1.. _unicode-howto: 2 3***************** 4 Unicode HOWTO 5***************** 6 7:Release: 1.12 8 9This HOWTO discusses Python's support for the Unicode specification 10for representing textual data, and explains various problems that 11people commonly encounter when trying to work with Unicode. 12 13 14Introduction to Unicode 15======================= 16 17Definitions 18----------- 19 20Today's programs need to be able to handle a wide variety of 21characters. Applications are often internationalized to display 22messages and output in a variety of user-selectable languages; the 23same program might need to output an error message in English, French, 24Japanese, Hebrew, or Russian. Web content can be written in any of 25these languages and can also include a variety of emoji symbols. 26Python's string type uses the Unicode Standard for representing 27characters, which lets Python programs work with all these different 28possible characters. 29 30Unicode (https://www.unicode.org/) is a specification that aims to 31list every character used by human languages and give each character 32its own unique code. The Unicode specifications are continually 33revised and updated to add new languages and symbols. 34 35A **character** is the smallest possible component of a text. 'A', 'B', 'C', 36etc., are all different characters. So are 'È' and 'Í'. Characters vary 37depending on the language or context you're talking 38about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's 39separate from the uppercase letter 'I'. They'll usually look the same, 40but these are two different characters that have different meanings. 41 42The Unicode standard describes how characters are represented by 43**code points**. A code point value is an integer in the range 0 to 440x10FFFF (about 1.1 million values, with some 110 thousand assigned so 45far). In the standard and in this document, a code point is written 46using the notation ``U+265E`` to mean the character with value 47``0x265e`` (9,822 in decimal). 48 49The Unicode standard contains a lot of tables listing characters and 50their corresponding code points: 51 52.. code-block:: none 53 54 0061 'a'; LATIN SMALL LETTER A 55 0062 'b'; LATIN SMALL LETTER B 56 0063 'c'; LATIN SMALL LETTER C 57 ... 58 007B '{'; LEFT CURLY BRACKET 59 ... 60 2167 'Ⅶ': ROMAN NUMERAL EIGHT 61 2168 'Ⅸ': ROMAN NUMERAL NINE 62 ... 63 265E '♞': BLACK CHESS KNIGHT 64 265F '♟': BLACK CHESS PAWN 65 ... 66 1F600 '': GRINNING FACE 67 1F609 '': WINKING FACE 68 ... 69 70Strictly, these definitions imply that it's meaningless to say 'this is 71character ``U+265E``'. ``U+265E`` is a code point, which represents some particular 72character; in this case, it represents the character 'BLACK CHESS KNIGHT', 73'♞'. In 74informal contexts, this distinction between code points and characters will 75sometimes be forgotten. 76 77A character is represented on a screen or on paper by a set of graphical 78elements that's called a **glyph**. The glyph for an uppercase A, for example, 79is two diagonal strokes and a horizontal stroke, though the exact details will 80depend on the font being used. Most Python code doesn't need to worry about 81glyphs; figuring out the correct glyph to display is generally the job of a GUI 82toolkit or a terminal's font renderer. 83 84 85Encodings 86--------- 87 88To summarize the previous section: a Unicode string is a sequence of 89code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 90decimal). This sequence of code points needs to be represented in 91memory as a set of **code units**, and **code units** are then mapped 92to 8-bit bytes. The rules for translating a Unicode string into a 93sequence of bytes are called a **character encoding**, or just 94an **encoding**. 95 96The first encoding you might think of is using 32-bit integers as the 97code unit, and then using the CPU's representation of 32-bit integers. 98In this representation, the string "Python" might look like this: 99 100.. code-block:: none 101 102 P y t h o n 103 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 104 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 105 106This representation is straightforward but using it presents a number of 107problems. 108 1091. It's not portable; different processors order the bytes differently. 110 1112. It's very wasteful of space. In most texts, the majority of the code points 112 are less than 127, or less than 255, so a lot of space is occupied by ``0x00`` 113 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an 114 ASCII representation. Increased RAM usage doesn't matter too much (desktop 115 computers have gigabytes of RAM, and strings aren't usually that large), but 116 expanding our usage of disk and network bandwidth by a factor of 4 is 117 intolerable. 118 1193. It's not compatible with existing C functions such as ``strlen()``, so a new 120 family of wide string functions would need to be used. 121 122Therefore this encoding isn't used very much, and people instead choose other 123encodings that are more efficient and convenient, such as UTF-8. 124 125UTF-8 is one of the most commonly used encodings, and Python often 126defaults to using it. UTF stands for "Unicode Transformation Format", 127and the '8' means that 8-bit values are used in the encoding. (There 128are also UTF-16 and UTF-32 encodings, but they are less frequently 129used than UTF-8.) UTF-8 uses the following rules: 130 1311. If the code point is < 128, it's represented by the corresponding byte value. 1322. If the code point is >= 128, it's turned into a sequence of two, three, or 133 four bytes, where each byte of the sequence is between 128 and 255. 134 135UTF-8 has several convenient properties: 136 1371. It can handle any Unicode code point. 1382. A Unicode string is turned into a sequence of bytes containing no embedded zero 139 bytes. This avoids byte-ordering issues, and means UTF-8 strings can be 140 processed by C functions such as ``strcpy()`` and sent through protocols that 141 can't handle zero bytes. 1423. A string of ASCII text is also valid UTF-8 text. 1434. UTF-8 is fairly compact; the majority of commonly used characters can be 144 represented with one or two bytes. 1455. If bytes are corrupted or lost, it's possible to determine the start of the 146 next UTF-8-encoded code point and resynchronize. It's also unlikely that 147 random 8-bit data will look like valid UTF-8. 148 149 150 151References 152---------- 153 154The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a 155glossary, and PDF versions of the Unicode specification. Be prepared for some 156difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the 157origin and development of Unicode is also available on the site. 158 159On the Computerphile Youtube channel, Tom Scott briefly 160`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>` 161(9 minutes 36 seconds). 162 163To help understand the standard, Jukka Korpela has written `an introductory 164guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the 165Unicode character tables. 166 167Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_ 168was written by Joel Spolsky. 169If this introduction didn't make things clear to you, you should try 170reading this alternate article before continuing. 171 172Wikipedia entries are often helpful; see the entries for "`character encoding 173<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8 174<https://en.wikipedia.org/wiki/UTF-8>`_, for example. 175 176 177Python's Unicode Support 178======================== 179 180Now that you've learned the rudiments of Unicode, we can look at Python's 181Unicode features. 182 183The String Type 184--------------- 185 186Since Python 3.0, the language's :class:`str` type contains Unicode 187characters, meaning any string created using ``"unicode rocks!"``, ``'unicode 188rocks!'``, or the triple-quoted string syntax is stored as Unicode. 189 190The default encoding for Python source code is UTF-8, so you can simply 191include a Unicode character in a string literal:: 192 193 try: 194 with open('/tmp/input.txt', 'r') as f: 195 ... 196 except OSError: 197 # 'File not found' error message. 198 print("Fichier non trouvé") 199 200Side note: Python 3 also supports using Unicode characters in identifiers:: 201 202 répertoire = "/tmp/records.log" 203 with open(répertoire, "w") as f: 204 f.write("test\n") 205 206If you can't enter a particular character in your editor or want to 207keep the source code ASCII-only for some reason, you can also use 208escape sequences in string literals. (Depending on your system, 209you may see the actual capital-delta glyph instead of a \u escape.) :: 210 211 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name 212 '\u0394' 213 >>> "\u0394" # Using a 16-bit hex value 214 '\u0394' 215 >>> "\U00000394" # Using a 32-bit hex value 216 '\u0394' 217 218In addition, one can create a string using the :func:`~bytes.decode` method of 219:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``, 220and optionally an *errors* argument. 221 222The *errors* argument specifies the response when the input string can't be 223converted according to the encoding's rules. Legal values for this argument are 224``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use 225``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the 226character out of the Unicode result), or ``'backslashreplace'`` (inserts a 227``\xNN`` escape sequence). 228The following examples show the differences:: 229 230 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE 231 Traceback (most recent call last): 232 ... 233 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: 234 invalid start byte 235 >>> b'\x80abc'.decode("utf-8", "replace") 236 '\ufffdabc' 237 >>> b'\x80abc'.decode("utf-8", "backslashreplace") 238 '\\x80abc' 239 >>> b'\x80abc'.decode("utf-8", "ignore") 240 'abc' 241 242Encodings are specified as strings containing the encoding's name. Python 243comes with roughly 100 different encodings; see the Python Library Reference at 244:ref:`standard-encodings` for a list. Some encodings have multiple names; for 245example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for 246the same encoding. 247 248One-character Unicode strings can also be created with the :func:`chr` 249built-in function, which takes integers and returns a Unicode string of length 1 250that contains the corresponding code point. The reverse operation is the 251built-in :func:`ord` function that takes a one-character Unicode string and 252returns the code point value:: 253 254 >>> chr(57344) 255 '\ue000' 256 >>> ord('\ue000') 257 57344 258 259Converting to Bytes 260------------------- 261 262The opposite method of :meth:`bytes.decode` is :meth:`str.encode`, 263which returns a :class:`bytes` representation of the Unicode string, encoded in the 264requested *encoding*. 265 266The *errors* parameter is the same as the parameter of the 267:meth:`~bytes.decode` method but supports a few more possible handlers. As well as 268``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case 269inserts a question mark instead of the unencodable character), there is 270also ``'xmlcharrefreplace'`` (inserts an XML character reference), 271``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and 272``namereplace`` (inserts a ``\N{...}`` escape sequence). 273 274The following example shows the different results:: 275 276 >>> u = chr(40960) + 'abcd' + chr(1972) 277 >>> u.encode('utf-8') 278 b'\xea\x80\x80abcd\xde\xb4' 279 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE 280 Traceback (most recent call last): 281 ... 282 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in 283 position 0: ordinal not in range(128) 284 >>> u.encode('ascii', 'ignore') 285 b'abcd' 286 >>> u.encode('ascii', 'replace') 287 b'?abcd?' 288 >>> u.encode('ascii', 'xmlcharrefreplace') 289 b'ꀀabcd޴' 290 >>> u.encode('ascii', 'backslashreplace') 291 b'\\ua000abcd\\u07b4' 292 >>> u.encode('ascii', 'namereplace') 293 b'\\N{YI SYLLABLE IT}abcd\\u07b4' 294 295The low-level routines for registering and accessing the available 296encodings are found in the :mod:`codecs` module. Implementing new 297encodings also requires understanding the :mod:`codecs` module. 298However, the encoding and decoding functions returned by this module 299are usually more low-level than is comfortable, and writing new encodings 300is a specialized task, so the module won't be covered in this HOWTO. 301 302 303Unicode Literals in Python Source Code 304-------------------------------------- 305 306In Python source code, specific Unicode code points can be written using the 307``\u`` escape sequence, which is followed by four hex digits giving the code 308point. The ``\U`` escape sequence is similar, but expects eight hex digits, 309not four:: 310 311 >>> s = "a\xac\u1234\u20ac\U00008000" 312 ... # ^^^^ two-digit hex escape 313 ... # ^^^^^^ four-digit Unicode escape 314 ... # ^^^^^^^^^^ eight-digit Unicode escape 315 >>> [ord(c) for c in s] 316 [97, 172, 4660, 8364, 32768] 317 318Using escape sequences for code points greater than 127 is fine in small doses, 319but becomes an annoyance if you're using many accented characters, as you would 320in a program with messages in French or some other accent-using language. You 321can also assemble strings using the :func:`chr` built-in function, but this is 322even more tedious. 323 324Ideally, you'd want to be able to write literals in your language's natural 325encoding. You could then edit Python source code with your favorite editor 326which would display the accented characters naturally, and have the right 327characters used at runtime. 328 329Python supports writing source code in UTF-8 by default, but you can use almost 330any encoding if you declare the encoding being used. This is done by including 331a special comment as either the first or second line of the source file:: 332 333 #!/usr/bin/env python 334 # -*- coding: latin-1 -*- 335 336 u = 'abcdé' 337 print(ord(u[-1])) 338 339The syntax is inspired by Emacs's notation for specifying variables local to a 340file. Emacs supports many different variables, but Python only supports 341'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special; 342they have no significance to Python but are a convention. Python looks for 343``coding: name`` or ``coding=name`` in the comment. 344 345If you don't include such a comment, the default encoding used will be UTF-8 as 346already mentioned. See also :pep:`263` for more information. 347 348 349Unicode Properties 350------------------ 351 352The Unicode specification includes a database of information about 353code points. For each defined code point, the information includes 354the character's name, its category, the numeric value if applicable 355(for characters representing numeric concepts such as the Roman 356numerals, fractions such as one-third and four-fifths, etc.). There 357are also display-related properties, such as how to use the code point 358in bidirectional text. 359 360The following program displays some information about several characters, and 361prints the numeric value of one particular character:: 362 363 import unicodedata 364 365 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231) 366 367 for i, c in enumerate(u): 368 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ") 369 print(unicodedata.name(c)) 370 371 # Get numeric value of second character 372 print(unicodedata.numeric(u[1])) 373 374When run, this prints: 375 376.. code-block:: none 377 378 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE 379 1 0bf2 No TAMIL NUMBER ONE THOUSAND 380 2 0f84 Mn TIBETAN MARK HALANTA 381 3 1770 Lo TAGBANWA LETTER SA 382 4 33af So SQUARE RAD OVER S SQUARED 383 1000.0 384 385The category codes are abbreviations describing the nature of the character. 386These are grouped into categories such as "Letter", "Number", "Punctuation", or 387"Symbol", which in turn are broken up into subcategories. To take the codes 388from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means 389"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, 390other". See 391`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a 392list of category codes. 393 394 395Comparing Strings 396----------------- 397 398Unicode adds some complication to comparing strings, because the same 399set of characters can be represented by different sequences of code 400points. For example, a letter like 'ê' can be represented as a single 401code point U+00EA, or as U+0065 U+0302, which is the code point for 402'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These 403will produce the same output when printed, but one is a string of 404length 1 and the other is of length 2. 405 406One tool for a case-insensitive comparison is the 407:meth:`~str.casefold` string method that converts a string to a 408case-insensitive form following an algorithm described by the Unicode 409Standard. This algorithm has special handling for characters such as 410the German letter 'ß' (code point U+00DF), which becomes the pair of 411lowercase letters 'ss'. 412 413:: 414 415 >>> street = 'Gürzenichstraße' 416 >>> street.casefold() 417 'gürzenichstrasse' 418 419A second tool is the :mod:`unicodedata` module's 420:func:`~unicodedata.normalize` function that converts strings to one 421of several normal forms, where letters followed by a combining 422character are replaced with single characters. :func:`normalize` can 423be used to perform string comparisons that won't falsely report 424inequality if two strings use combining characters differently: 425 426:: 427 428 import unicodedata 429 430 def compare_strs(s1, s2): 431 def NFD(s): 432 return unicodedata.normalize('NFD', s) 433 434 return NFD(s1) == NFD(s2) 435 436 single_char = 'ê' 437 multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}' 438 print('length of first string=', len(single_char)) 439 print('length of second string=', len(multiple_chars)) 440 print(compare_strs(single_char, multiple_chars)) 441 442When run, this outputs: 443 444.. code-block:: shell-session 445 446 $ python3 compare-strs.py 447 length of first string= 1 448 length of second string= 2 449 True 450 451The first argument to the :func:`~unicodedata.normalize` function is a 452string giving the desired normalization form, which can be one of 453'NFC', 'NFKC', 'NFD', and 'NFKD'. 454 455The Unicode Standard also specifies how to do caseless comparisons:: 456 457 import unicodedata 458 459 def compare_caseless(s1, s2): 460 def NFD(s): 461 return unicodedata.normalize('NFD', s) 462 463 return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold()) 464 465 # Example usage 466 single_char = 'ê' 467 multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}' 468 469 print(compare_caseless(single_char, multiple_chars)) 470 471This will print ``True``. (Why is :func:`NFD` invoked twice? Because 472there are a few characters that make :meth:`casefold` return a 473non-normalized string, so the result needs to be normalized again. See 474section 3.13 of the Unicode Standard for a discussion and an example.) 475 476 477Unicode Regular Expressions 478--------------------------- 479 480The regular expressions supported by the :mod:`re` module can be provided 481either as bytes or strings. Some of the special character sequences such as 482``\d`` and ``\w`` have different meanings depending on whether 483the pattern is supplied as bytes or a string. For example, 484``\d`` will match the characters ``[0-9]`` in bytes but 485in strings will match any character that's in the ``'Nd'`` category. 486 487The string in this example has the number 57 written in both Thai and 488Arabic numerals:: 489 490 import re 491 p = re.compile(r'\d+') 492 493 s = "Over \u0e55\u0e57 57 flavours" 494 m = p.search(s) 495 print(repr(m.group())) 496 497When executed, ``\d+`` will match the Thai numerals and print them 498out. If you supply the :const:`re.ASCII` flag to 499:func:`~re.compile`, ``\d+`` will match the substring "57" instead. 500 501Similarly, ``\w`` matches a wide variety of Unicode characters but 502only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied, 503and ``\s`` will match either Unicode whitespace characters or 504``[ \t\n\r\f\v]``. 505 506 507References 508---------- 509 510.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section? 511 512Some good alternative discussions of Python's Unicode support are: 513 514* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan. 515* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder. 516 517The :class:`str` type is described in the Python library reference at 518:ref:`textseq`. 519 520The documentation for the :mod:`unicodedata` module. 521 522The documentation for the :mod:`codecs` module. 523 524Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides) 525<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at 526EuroPython 2002. The slides are an excellent overview of the design of Python 5272's Unicode features (where the Unicode string type is called ``unicode`` and 528literals start with ``u``). 529 530 531Reading and Writing Unicode Data 532================================ 533 534Once you've written some code that works with Unicode data, the next problem is 535input/output. How do you get Unicode strings into your program, and how do you 536convert Unicode into a form suitable for storage or transmission? 537 538It's possible that you may not need to do anything depending on your input 539sources and output destinations; you should check whether the libraries used in 540your application support Unicode natively. XML parsers often return Unicode 541data, for example. Many relational databases also support Unicode-valued 542columns and can return Unicode values from an SQL query. 543 544Unicode data is usually converted to a particular encoding before it gets 545written to disk or sent over a socket. It's possible to do all the work 546yourself: open a file, read an 8-bit bytes object from it, and convert the bytes 547with ``bytes.decode(encoding)``. However, the manual approach is not recommended. 548 549One problem is the multi-byte nature of encodings; one Unicode character can be 550represented by several bytes. If you want to read the file in arbitrary-sized 551chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case 552where only part of the bytes encoding a single Unicode character are read at the 553end of a chunk. One solution would be to read the entire file into memory and 554then perform the decoding, but that prevents you from working with files that 555are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM. 556(More, really, since for at least a moment you'd need to have both the encoded 557string and its Unicode version in memory.) 558 559The solution would be to use the low-level decoding interface to catch the case 560of partial coding sequences. The work of implementing this has already been 561done for you: the built-in :func:`open` function can return a file-like object 562that assumes the file's contents are in a specified encoding and accepts Unicode 563parameters for methods such as :meth:`~io.TextIOBase.read` and 564:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and 565*errors* parameters which are interpreted just like those in :meth:`str.encode` 566and :meth:`bytes.decode`. 567 568Reading Unicode from a file is therefore simple:: 569 570 with open('unicode.txt', encoding='utf-8') as f: 571 for line in f: 572 print(repr(line)) 573 574It's also possible to open files in update mode, allowing both reading and 575writing:: 576 577 with open('test', encoding='utf-8', mode='w+') as f: 578 f.write('\u4500 blah blah blah\n') 579 f.seek(0) 580 print(repr(f.readline()[:1])) 581 582The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often 583written as the first character of a file in order to assist with autodetection 584of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be 585present at the start of a file; when such an encoding is used, the BOM will be 586automatically written as the first character and will be silently dropped when 587the file is read. There are variants of these encodings, such as 'utf-16-le' 588and 'utf-16-be' for little-endian and big-endian encodings, that specify one 589particular byte ordering and don't skip the BOM. 590 591In some areas, it is also convention to use a "BOM" at the start of UTF-8 592encoded files; the name is misleading since UTF-8 is not byte-order dependent. 593The mark simply announces that the file is encoded in UTF-8. For reading such 594files, use the 'utf-8-sig' codec to automatically skip the mark if present. 595 596 597Unicode filenames 598----------------- 599 600Most of the operating systems in common use today support filenames 601that contain arbitrary Unicode characters. Usually this is 602implemented by converting the Unicode string into some encoding that 603varies depending on the system. Today Python is converging on using 604UTF-8: Python on MacOS has used UTF-8 for several versions, and Python 6053.6 switched to using UTF-8 on Windows as well. On Unix systems, 606there will only be a filesystem encoding if you've set the ``LANG`` or 607``LC_CTYPE`` environment variables; if you haven't, the default 608encoding is again UTF-8. 609 610The :func:`sys.getfilesystemencoding` function returns the encoding to use on 611your current system, in case you want to do the encoding manually, but there's 612not much reason to bother. When opening a file for reading or writing, you can 613usually just provide the Unicode string as the filename, and it will be 614automatically converted to the right encoding for you:: 615 616 filename = 'filename\u4500abc' 617 with open(filename, 'w') as f: 618 f.write('blah\n') 619 620Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode 621filenames. 622 623The :func:`os.listdir` function returns filenames, which raises an issue: should it return 624the Unicode version of filenames, or should it return bytes containing 625the encoded versions? :func:`os.listdir` can do both, depending on whether you 626provided the directory path as bytes or a Unicode string. If you pass a 627Unicode string as the path, filenames will be decoded using the filesystem's 628encoding and a list of Unicode strings will be returned, while passing a byte 629path will return the filenames as bytes. For example, 630assuming the default filesystem encoding is UTF-8, running the following 631program:: 632 633 fn = 'filename\u4500abc' 634 f = open(fn, 'w') 635 f.close() 636 637 import os 638 print(os.listdir(b'.')) 639 print(os.listdir('.')) 640 641will produce the following output: 642 643.. code-block:: shell-session 644 645 $ python listdir-test.py 646 [b'filename\xe4\x94\x80abc', ...] 647 ['filename\u4500abc', ...] 648 649The first list contains UTF-8-encoded filenames, and the second list contains 650the Unicode versions. 651 652Note that on most occasions, you should can just stick with using 653Unicode with these APIs. The bytes APIs should only be used on 654systems where undecodable file names can be present; that's 655pretty much only Unix systems now. 656 657 658Tips for Writing Unicode-aware Programs 659--------------------------------------- 660 661This section provides some suggestions on writing software that deals with 662Unicode. 663 664The most important tip is: 665 666 Software should only work with Unicode strings internally, decoding the input 667 data as soon as possible and encoding the output only at the end. 668 669If you attempt to write processing functions that accept both Unicode and byte 670strings, you will find your program vulnerable to bugs wherever you combine the 671two different kinds of strings. There is no automatic encoding or decoding: if 672you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised. 673 674When using data coming from a web browser or some other untrusted source, a 675common technique is to check for illegal characters in a string before using the 676string in a generated command line or storing it in a database. If you're doing 677this, be careful to check the decoded string, not the encoded bytes data; 678some encodings may have interesting properties, such as not being bijective 679or not being fully ASCII-compatible. This is especially true if the input 680data also specifies the encoding, since the attacker can then choose a 681clever way to hide malicious text in the encoded bytestream. 682 683 684Converting Between File Encodings 685''''''''''''''''''''''''''''''''' 686 687The :class:`~codecs.StreamRecoder` class can transparently convert between 688encodings, taking a stream that returns data in encoding #1 689and behaving like a stream returning data in encoding #2. 690 691For example, if you have an input file *f* that's in Latin-1, you 692can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in 693UTF-8:: 694 695 new_f = codecs.StreamRecoder(f, 696 # en/decoder: used by read() to encode its results and 697 # by write() to decode its input. 698 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'), 699 700 # reader/writer: used to read and write to the stream. 701 codecs.getreader('latin-1'), codecs.getwriter('latin-1') ) 702 703 704Files in an Unknown Encoding 705'''''''''''''''''''''''''''' 706 707What can you do if you need to make a change to a file, but don't know 708the file's encoding? If you know the encoding is ASCII-compatible and 709only want to examine or modify the ASCII parts, you can open the file 710with the ``surrogateescape`` error handler:: 711 712 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f: 713 data = f.read() 714 715 # make changes to the string 'data' 716 717 with open(fname + '.new', 'w', 718 encoding="ascii", errors="surrogateescape") as f: 719 f.write(data) 720 721The ``surrogateescape`` error handler will decode any non-ASCII bytes 722as code points in a special range running from U+DC80 to 723U+DCFF. These code points will then turn back into the 724same bytes when the ``surrogateescape`` error handler is used to 725encode the data and write it back out. 726 727 728References 729---------- 730 731One section of `Mastering Python 3 Input/Output 732<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_, 733a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling. 734 735The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware 736Applications in Python" 737<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_ 738discuss questions of character encodings as well as how to internationalize 739and localize an application. These slides cover Python 2.x only. 740 741`The Guts of Unicode in Python 742<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_ 743is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode 744representation in Python 3.3. 745 746 747Acknowledgements 748================ 749 750The initial draft of this document was written by Andrew Kuchling. 751It has since been revised further by Alexander Belopolsky, Georg Brandl, 752Andrew Kuchling, and Ezio Melotti. 753 754Thanks to the following people who have noted errors or offered 755suggestions on this article: Éric Araujo, Nicholas Bastin, Nick 756Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André 757Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka, 758Eryk Sun, Chad Whitacre, Graham Wideman. 759