1.. _unicode-howto: 2 3***************** 4 Unicode HOWTO 5***************** 6 7:Release: 1.12 8 9This HOWTO discusses Python's support for the Unicode specification 10for representing textual data, and explains various problems that 11people commonly encounter when trying to work with Unicode. 12 13 14Introduction to Unicode 15======================= 16 17Definitions 18----------- 19 20Today's programs need to be able to handle a wide variety of 21characters. Applications are often internationalized to display 22messages and output in a variety of user-selectable languages; the 23same program might need to output an error message in English, French, 24Japanese, Hebrew, or Russian. Web content can be written in any of 25these languages and can also include a variety of emoji symbols. 26Python's string type uses the Unicode Standard for representing 27characters, which lets Python programs work with all these different 28possible characters. 29 30Unicode (https://www.unicode.org/) is a specification that aims to 31list every character used by human languages and give each character 32its own unique code. The Unicode specifications are continually 33revised and updated to add new languages and symbols. 34 35A **character** is the smallest possible component of a text. 'A', 'B', 'C', 36etc., are all different characters. So are 'È' and 'Í'. Characters vary 37depending on the language or context you're talking 38about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's 39separate from the uppercase letter 'I'. They'll usually look the same, 40but these are two different characters that have different meanings. 41 42The Unicode standard describes how characters are represented by 43**code points**. A code point value is an integer in the range 0 to 440x10FFFF (about 1.1 million values, the 45`actual number assigned <https://www.unicode.org/versions/latest/#Summary>`_ 46is less than that). In the standard and in this document, a code point is written 47using the notation ``U+265E`` to mean the character with value 48``0x265e`` (9,822 in decimal). 49 50The Unicode standard contains a lot of tables listing characters and 51their corresponding code points: 52 53.. code-block:: none 54 55 0061 'a'; LATIN SMALL LETTER A 56 0062 'b'; LATIN SMALL LETTER B 57 0063 'c'; LATIN SMALL LETTER C 58 ... 59 007B '{'; LEFT CURLY BRACKET 60 ... 61 2167 'Ⅷ'; ROMAN NUMERAL EIGHT 62 2168 'Ⅸ'; ROMAN NUMERAL NINE 63 ... 64 265E '♞'; BLACK CHESS KNIGHT 65 265F '♟'; BLACK CHESS PAWN 66 ... 67 1F600 ''; GRINNING FACE 68 1F609 ''; WINKING FACE 69 ... 70 71Strictly, these definitions imply that it's meaningless to say 'this is 72character ``U+265E``'. ``U+265E`` is a code point, which represents some particular 73character; in this case, it represents the character 'BLACK CHESS KNIGHT', 74'♞'. In 75informal contexts, this distinction between code points and characters will 76sometimes be forgotten. 77 78A character is represented on a screen or on paper by a set of graphical 79elements that's called a **glyph**. The glyph for an uppercase A, for example, 80is two diagonal strokes and a horizontal stroke, though the exact details will 81depend on the font being used. Most Python code doesn't need to worry about 82glyphs; figuring out the correct glyph to display is generally the job of a GUI 83toolkit or a terminal's font renderer. 84 85 86Encodings 87--------- 88 89To summarize the previous section: a Unicode string is a sequence of 90code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 91decimal). This sequence of code points needs to be represented in 92memory as a set of **code units**, and **code units** are then mapped 93to 8-bit bytes. The rules for translating a Unicode string into a 94sequence of bytes are called a **character encoding**, or just 95an **encoding**. 96 97The first encoding you might think of is using 32-bit integers as the 98code unit, and then using the CPU's representation of 32-bit integers. 99In this representation, the string "Python" might look like this: 100 101.. code-block:: none 102 103 P y t h o n 104 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 105 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 106 107This representation is straightforward but using it presents a number of 108problems. 109 1101. It's not portable; different processors order the bytes differently. 111 1122. It's very wasteful of space. In most texts, the majority of the code points 113 are less than 127, or less than 255, so a lot of space is occupied by ``0x00`` 114 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an 115 ASCII representation. Increased RAM usage doesn't matter too much (desktop 116 computers have gigabytes of RAM, and strings aren't usually that large), but 117 expanding our usage of disk and network bandwidth by a factor of 4 is 118 intolerable. 119 1203. It's not compatible with existing C functions such as ``strlen()``, so a new 121 family of wide string functions would need to be used. 122 123Therefore this encoding isn't used very much, and people instead choose other 124encodings that are more efficient and convenient, such as UTF-8. 125 126UTF-8 is one of the most commonly used encodings, and Python often 127defaults to using it. UTF stands for "Unicode Transformation Format", 128and the '8' means that 8-bit values are used in the encoding. (There 129are also UTF-16 and UTF-32 encodings, but they are less frequently 130used than UTF-8.) UTF-8 uses the following rules: 131 1321. If the code point is < 128, it's represented by the corresponding byte value. 1332. If the code point is >= 128, it's turned into a sequence of two, three, or 134 four bytes, where each byte of the sequence is between 128 and 255. 135 136UTF-8 has several convenient properties: 137 1381. It can handle any Unicode code point. 1392. A Unicode string is turned into a sequence of bytes that contains embedded 140 zero bytes only where they represent the null character (U+0000). This means 141 that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent 142 through protocols that can't handle zero bytes for anything other than 143 end-of-string markers. 1443. A string of ASCII text is also valid UTF-8 text. 1454. UTF-8 is fairly compact; the majority of commonly used characters can be 146 represented with one or two bytes. 1475. If bytes are corrupted or lost, it's possible to determine the start of the 148 next UTF-8-encoded code point and resynchronize. It's also unlikely that 149 random 8-bit data will look like valid UTF-8. 1506. UTF-8 is a byte oriented encoding. The encoding specifies that each 151 character is represented by a specific sequence of one or more bytes. This 152 avoids the byte-ordering issues that can occur with integer and word oriented 153 encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending 154 on the hardware on which the string was encoded. 155 156 157References 158---------- 159 160The `Unicode Consortium site <https://www.unicode.org>`_ has character charts, a 161glossary, and PDF versions of the Unicode specification. Be prepared for some 162difficult reading. `A chronology <https://www.unicode.org/history/>`_ of the 163origin and development of Unicode is also available on the site. 164 165On the Computerphile Youtube channel, Tom Scott briefly 166`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_ 167(9 minutes 36 seconds). 168 169To help understand the standard, Jukka Korpela has written `an introductory 170guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the 171Unicode character tables. 172 173Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_ 174was written by Joel Spolsky. 175If this introduction didn't make things clear to you, you should try 176reading this alternate article before continuing. 177 178Wikipedia entries are often helpful; see the entries for "`character encoding 179<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8 180<https://en.wikipedia.org/wiki/UTF-8>`_, for example. 181 182 183Python's Unicode Support 184======================== 185 186Now that you've learned the rudiments of Unicode, we can look at Python's 187Unicode features. 188 189The String Type 190--------------- 191 192Since Python 3.0, the language's :class:`str` type contains Unicode 193characters, meaning any string created using ``"unicode rocks!"``, ``'unicode 194rocks!'``, or the triple-quoted string syntax is stored as Unicode. 195 196The default encoding for Python source code is UTF-8, so you can simply 197include a Unicode character in a string literal:: 198 199 try: 200 with open('/tmp/input.txt', 'r') as f: 201 ... 202 except OSError: 203 # 'File not found' error message. 204 print("Fichier non trouvé") 205 206Side note: Python 3 also supports using Unicode characters in identifiers:: 207 208 répertoire = "/tmp/records.log" 209 with open(répertoire, "w") as f: 210 f.write("test\n") 211 212If you can't enter a particular character in your editor or want to 213keep the source code ASCII-only for some reason, you can also use 214escape sequences in string literals. (Depending on your system, 215you may see the actual capital-delta glyph instead of a \u escape.) :: 216 217 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name 218 '\u0394' 219 >>> "\u0394" # Using a 16-bit hex value 220 '\u0394' 221 >>> "\U00000394" # Using a 32-bit hex value 222 '\u0394' 223 224In addition, one can create a string using the :func:`~bytes.decode` method of 225:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``, 226and optionally an *errors* argument. 227 228The *errors* argument specifies the response when the input string can't be 229converted according to the encoding's rules. Legal values for this argument are 230``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use 231``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the 232character out of the Unicode result), or ``'backslashreplace'`` (inserts a 233``\xNN`` escape sequence). 234The following examples show the differences:: 235 236 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE 237 Traceback (most recent call last): 238 ... 239 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: 240 invalid start byte 241 >>> b'\x80abc'.decode("utf-8", "replace") 242 '\ufffdabc' 243 >>> b'\x80abc'.decode("utf-8", "backslashreplace") 244 '\\x80abc' 245 >>> b'\x80abc'.decode("utf-8", "ignore") 246 'abc' 247 248Encodings are specified as strings containing the encoding's name. Python 249comes with roughly 100 different encodings; see the Python Library Reference at 250:ref:`standard-encodings` for a list. Some encodings have multiple names; for 251example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for 252the same encoding. 253 254One-character Unicode strings can also be created with the :func:`chr` 255built-in function, which takes integers and returns a Unicode string of length 1 256that contains the corresponding code point. The reverse operation is the 257built-in :func:`ord` function that takes a one-character Unicode string and 258returns the code point value:: 259 260 >>> chr(57344) 261 '\ue000' 262 >>> ord('\ue000') 263 57344 264 265Converting to Bytes 266------------------- 267 268The opposite method of :meth:`bytes.decode` is :meth:`str.encode`, 269which returns a :class:`bytes` representation of the Unicode string, encoded in the 270requested *encoding*. 271 272The *errors* parameter is the same as the parameter of the 273:meth:`~bytes.decode` method but supports a few more possible handlers. As well as 274``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case 275inserts a question mark instead of the unencodable character), there is 276also ``'xmlcharrefreplace'`` (inserts an XML character reference), 277``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and 278``namereplace`` (inserts a ``\N{...}`` escape sequence). 279 280The following example shows the different results:: 281 282 >>> u = chr(40960) + 'abcd' + chr(1972) 283 >>> u.encode('utf-8') 284 b'\xea\x80\x80abcd\xde\xb4' 285 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE 286 Traceback (most recent call last): 287 ... 288 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in 289 position 0: ordinal not in range(128) 290 >>> u.encode('ascii', 'ignore') 291 b'abcd' 292 >>> u.encode('ascii', 'replace') 293 b'?abcd?' 294 >>> u.encode('ascii', 'xmlcharrefreplace') 295 b'ꀀabcd޴' 296 >>> u.encode('ascii', 'backslashreplace') 297 b'\\ua000abcd\\u07b4' 298 >>> u.encode('ascii', 'namereplace') 299 b'\\N{YI SYLLABLE IT}abcd\\u07b4' 300 301The low-level routines for registering and accessing the available 302encodings are found in the :mod:`codecs` module. Implementing new 303encodings also requires understanding the :mod:`codecs` module. 304However, the encoding and decoding functions returned by this module 305are usually more low-level than is comfortable, and writing new encodings 306is a specialized task, so the module won't be covered in this HOWTO. 307 308 309Unicode Literals in Python Source Code 310-------------------------------------- 311 312In Python source code, specific Unicode code points can be written using the 313``\u`` escape sequence, which is followed by four hex digits giving the code 314point. The ``\U`` escape sequence is similar, but expects eight hex digits, 315not four:: 316 317 >>> s = "a\xac\u1234\u20ac\U00008000" 318 ... # ^^^^ two-digit hex escape 319 ... # ^^^^^^ four-digit Unicode escape 320 ... # ^^^^^^^^^^ eight-digit Unicode escape 321 >>> [ord(c) for c in s] 322 [97, 172, 4660, 8364, 32768] 323 324Using escape sequences for code points greater than 127 is fine in small doses, 325but becomes an annoyance if you're using many accented characters, as you would 326in a program with messages in French or some other accent-using language. You 327can also assemble strings using the :func:`chr` built-in function, but this is 328even more tedious. 329 330Ideally, you'd want to be able to write literals in your language's natural 331encoding. You could then edit Python source code with your favorite editor 332which would display the accented characters naturally, and have the right 333characters used at runtime. 334 335Python supports writing source code in UTF-8 by default, but you can use almost 336any encoding if you declare the encoding being used. This is done by including 337a special comment as either the first or second line of the source file:: 338 339 #!/usr/bin/env python 340 # -*- coding: latin-1 -*- 341 342 u = 'abcdé' 343 print(ord(u[-1])) 344 345The syntax is inspired by Emacs's notation for specifying variables local to a 346file. Emacs supports many different variables, but Python only supports 347'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special; 348they have no significance to Python but are a convention. Python looks for 349``coding: name`` or ``coding=name`` in the comment. 350 351If you don't include such a comment, the default encoding used will be UTF-8 as 352already mentioned. See also :pep:`263` for more information. 353 354 355Unicode Properties 356------------------ 357 358The Unicode specification includes a database of information about 359code points. For each defined code point, the information includes 360the character's name, its category, the numeric value if applicable 361(for characters representing numeric concepts such as the Roman 362numerals, fractions such as one-third and four-fifths, etc.). There 363are also display-related properties, such as how to use the code point 364in bidirectional text. 365 366The following program displays some information about several characters, and 367prints the numeric value of one particular character:: 368 369 import unicodedata 370 371 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231) 372 373 for i, c in enumerate(u): 374 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ") 375 print(unicodedata.name(c)) 376 377 # Get numeric value of second character 378 print(unicodedata.numeric(u[1])) 379 380When run, this prints: 381 382.. code-block:: none 383 384 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE 385 1 0bf2 No TAMIL NUMBER ONE THOUSAND 386 2 0f84 Mn TIBETAN MARK HALANTA 387 3 1770 Lo TAGBANWA LETTER SA 388 4 33af So SQUARE RAD OVER S SQUARED 389 1000.0 390 391The category codes are abbreviations describing the nature of the character. 392These are grouped into categories such as "Letter", "Number", "Punctuation", or 393"Symbol", which in turn are broken up into subcategories. To take the codes 394from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means 395"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, 396other". See 397`the General Category Values section of the Unicode Character Database documentation <https://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a 398list of category codes. 399 400 401Comparing Strings 402----------------- 403 404Unicode adds some complication to comparing strings, because the same 405set of characters can be represented by different sequences of code 406points. For example, a letter like 'ê' can be represented as a single 407code point U+00EA, or as U+0065 U+0302, which is the code point for 408'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These 409will produce the same output when printed, but one is a string of 410length 1 and the other is of length 2. 411 412One tool for a case-insensitive comparison is the 413:meth:`~str.casefold` string method that converts a string to a 414case-insensitive form following an algorithm described by the Unicode 415Standard. This algorithm has special handling for characters such as 416the German letter 'ß' (code point U+00DF), which becomes the pair of 417lowercase letters 'ss'. 418 419:: 420 421 >>> street = 'Gürzenichstraße' 422 >>> street.casefold() 423 'gürzenichstrasse' 424 425A second tool is the :mod:`unicodedata` module's 426:func:`~unicodedata.normalize` function that converts strings to one 427of several normal forms, where letters followed by a combining 428character are replaced with single characters. :func:`normalize` can 429be used to perform string comparisons that won't falsely report 430inequality if two strings use combining characters differently: 431 432:: 433 434 import unicodedata 435 436 def compare_strs(s1, s2): 437 def NFD(s): 438 return unicodedata.normalize('NFD', s) 439 440 return NFD(s1) == NFD(s2) 441 442 single_char = 'ê' 443 multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}' 444 print('length of first string=', len(single_char)) 445 print('length of second string=', len(multiple_chars)) 446 print(compare_strs(single_char, multiple_chars)) 447 448When run, this outputs: 449 450.. code-block:: shell-session 451 452 $ python3 compare-strs.py 453 length of first string= 1 454 length of second string= 2 455 True 456 457The first argument to the :func:`~unicodedata.normalize` function is a 458string giving the desired normalization form, which can be one of 459'NFC', 'NFKC', 'NFD', and 'NFKD'. 460 461The Unicode Standard also specifies how to do caseless comparisons:: 462 463 import unicodedata 464 465 def compare_caseless(s1, s2): 466 def NFD(s): 467 return unicodedata.normalize('NFD', s) 468 469 return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold()) 470 471 # Example usage 472 single_char = 'ê' 473 multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}' 474 475 print(compare_caseless(single_char, multiple_chars)) 476 477This will print ``True``. (Why is :func:`NFD` invoked twice? Because 478there are a few characters that make :meth:`casefold` return a 479non-normalized string, so the result needs to be normalized again. See 480section 3.13 of the Unicode Standard for a discussion and an example.) 481 482 483Unicode Regular Expressions 484--------------------------- 485 486The regular expressions supported by the :mod:`re` module can be provided 487either as bytes or strings. Some of the special character sequences such as 488``\d`` and ``\w`` have different meanings depending on whether 489the pattern is supplied as bytes or a string. For example, 490``\d`` will match the characters ``[0-9]`` in bytes but 491in strings will match any character that's in the ``'Nd'`` category. 492 493The string in this example has the number 57 written in both Thai and 494Arabic numerals:: 495 496 import re 497 p = re.compile(r'\d+') 498 499 s = "Over \u0e55\u0e57 57 flavours" 500 m = p.search(s) 501 print(repr(m.group())) 502 503When executed, ``\d+`` will match the Thai numerals and print them 504out. If you supply the :const:`re.ASCII` flag to 505:func:`~re.compile`, ``\d+`` will match the substring "57" instead. 506 507Similarly, ``\w`` matches a wide variety of Unicode characters but 508only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied, 509and ``\s`` will match either Unicode whitespace characters or 510``[ \t\n\r\f\v]``. 511 512 513References 514---------- 515 516.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section? 517 518Some good alternative discussions of Python's Unicode support are: 519 520* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan. 521* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder. 522 523The :class:`str` type is described in the Python library reference at 524:ref:`textseq`. 525 526The documentation for the :mod:`unicodedata` module. 527 528The documentation for the :mod:`codecs` module. 529 530Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides) 531<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at 532EuroPython 2002. The slides are an excellent overview of the design of Python 5332's Unicode features (where the Unicode string type is called ``unicode`` and 534literals start with ``u``). 535 536 537Reading and Writing Unicode Data 538================================ 539 540Once you've written some code that works with Unicode data, the next problem is 541input/output. How do you get Unicode strings into your program, and how do you 542convert Unicode into a form suitable for storage or transmission? 543 544It's possible that you may not need to do anything depending on your input 545sources and output destinations; you should check whether the libraries used in 546your application support Unicode natively. XML parsers often return Unicode 547data, for example. Many relational databases also support Unicode-valued 548columns and can return Unicode values from an SQL query. 549 550Unicode data is usually converted to a particular encoding before it gets 551written to disk or sent over a socket. It's possible to do all the work 552yourself: open a file, read an 8-bit bytes object from it, and convert the bytes 553with ``bytes.decode(encoding)``. However, the manual approach is not recommended. 554 555One problem is the multi-byte nature of encodings; one Unicode character can be 556represented by several bytes. If you want to read the file in arbitrary-sized 557chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case 558where only part of the bytes encoding a single Unicode character are read at the 559end of a chunk. One solution would be to read the entire file into memory and 560then perform the decoding, but that prevents you from working with files that 561are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM. 562(More, really, since for at least a moment you'd need to have both the encoded 563string and its Unicode version in memory.) 564 565The solution would be to use the low-level decoding interface to catch the case 566of partial coding sequences. The work of implementing this has already been 567done for you: the built-in :func:`open` function can return a file-like object 568that assumes the file's contents are in a specified encoding and accepts Unicode 569parameters for methods such as :meth:`~io.TextIOBase.read` and 570:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and 571*errors* parameters which are interpreted just like those in :meth:`str.encode` 572and :meth:`bytes.decode`. 573 574Reading Unicode from a file is therefore simple:: 575 576 with open('unicode.txt', encoding='utf-8') as f: 577 for line in f: 578 print(repr(line)) 579 580It's also possible to open files in update mode, allowing both reading and 581writing:: 582 583 with open('test', encoding='utf-8', mode='w+') as f: 584 f.write('\u4500 blah blah blah\n') 585 f.seek(0) 586 print(repr(f.readline()[:1])) 587 588The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often 589written as the first character of a file in order to assist with autodetection 590of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be 591present at the start of a file; when such an encoding is used, the BOM will be 592automatically written as the first character and will be silently dropped when 593the file is read. There are variants of these encodings, such as 'utf-16-le' 594and 'utf-16-be' for little-endian and big-endian encodings, that specify one 595particular byte ordering and don't skip the BOM. 596 597In some areas, it is also convention to use a "BOM" at the start of UTF-8 598encoded files; the name is misleading since UTF-8 is not byte-order dependent. 599The mark simply announces that the file is encoded in UTF-8. For reading such 600files, use the 'utf-8-sig' codec to automatically skip the mark if present. 601 602 603Unicode filenames 604----------------- 605 606Most of the operating systems in common use today support filenames 607that contain arbitrary Unicode characters. Usually this is 608implemented by converting the Unicode string into some encoding that 609varies depending on the system. Today Python is converging on using 610UTF-8: Python on MacOS has used UTF-8 for several versions, and Python 6113.6 switched to using UTF-8 on Windows as well. On Unix systems, 612there will only be a :term:`filesystem encoding <filesystem encoding and error 613handler>`. if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if 614you haven't, the default encoding is again UTF-8. 615 616The :func:`sys.getfilesystemencoding` function returns the encoding to use on 617your current system, in case you want to do the encoding manually, but there's 618not much reason to bother. When opening a file for reading or writing, you can 619usually just provide the Unicode string as the filename, and it will be 620automatically converted to the right encoding for you:: 621 622 filename = 'filename\u4500abc' 623 with open(filename, 'w') as f: 624 f.write('blah\n') 625 626Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode 627filenames. 628 629The :func:`os.listdir` function returns filenames, which raises an issue: should it return 630the Unicode version of filenames, or should it return bytes containing 631the encoded versions? :func:`os.listdir` can do both, depending on whether you 632provided the directory path as bytes or a Unicode string. If you pass a 633Unicode string as the path, filenames will be decoded using the filesystem's 634encoding and a list of Unicode strings will be returned, while passing a byte 635path will return the filenames as bytes. For example, 636assuming the default :term:`filesystem encoding <filesystem encoding and error 637handler>` is UTF-8, running the following program:: 638 639 fn = 'filename\u4500abc' 640 f = open(fn, 'w') 641 f.close() 642 643 import os 644 print(os.listdir(b'.')) 645 print(os.listdir('.')) 646 647will produce the following output: 648 649.. code-block:: shell-session 650 651 $ python listdir-test.py 652 [b'filename\xe4\x94\x80abc', ...] 653 ['filename\u4500abc', ...] 654 655The first list contains UTF-8-encoded filenames, and the second list contains 656the Unicode versions. 657 658Note that on most occasions, you should can just stick with using 659Unicode with these APIs. The bytes APIs should only be used on 660systems where undecodable file names can be present; that's 661pretty much only Unix systems now. 662 663 664Tips for Writing Unicode-aware Programs 665--------------------------------------- 666 667This section provides some suggestions on writing software that deals with 668Unicode. 669 670The most important tip is: 671 672 Software should only work with Unicode strings internally, decoding the input 673 data as soon as possible and encoding the output only at the end. 674 675If you attempt to write processing functions that accept both Unicode and byte 676strings, you will find your program vulnerable to bugs wherever you combine the 677two different kinds of strings. There is no automatic encoding or decoding: if 678you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised. 679 680When using data coming from a web browser or some other untrusted source, a 681common technique is to check for illegal characters in a string before using the 682string in a generated command line or storing it in a database. If you're doing 683this, be careful to check the decoded string, not the encoded bytes data; 684some encodings may have interesting properties, such as not being bijective 685or not being fully ASCII-compatible. This is especially true if the input 686data also specifies the encoding, since the attacker can then choose a 687clever way to hide malicious text in the encoded bytestream. 688 689 690Converting Between File Encodings 691''''''''''''''''''''''''''''''''' 692 693The :class:`~codecs.StreamRecoder` class can transparently convert between 694encodings, taking a stream that returns data in encoding #1 695and behaving like a stream returning data in encoding #2. 696 697For example, if you have an input file *f* that's in Latin-1, you 698can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in 699UTF-8:: 700 701 new_f = codecs.StreamRecoder(f, 702 # en/decoder: used by read() to encode its results and 703 # by write() to decode its input. 704 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'), 705 706 # reader/writer: used to read and write to the stream. 707 codecs.getreader('latin-1'), codecs.getwriter('latin-1') ) 708 709 710Files in an Unknown Encoding 711'''''''''''''''''''''''''''' 712 713What can you do if you need to make a change to a file, but don't know 714the file's encoding? If you know the encoding is ASCII-compatible and 715only want to examine or modify the ASCII parts, you can open the file 716with the ``surrogateescape`` error handler:: 717 718 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f: 719 data = f.read() 720 721 # make changes to the string 'data' 722 723 with open(fname + '.new', 'w', 724 encoding="ascii", errors="surrogateescape") as f: 725 f.write(data) 726 727The ``surrogateescape`` error handler will decode any non-ASCII bytes 728as code points in a special range running from U+DC80 to 729U+DCFF. These code points will then turn back into the 730same bytes when the ``surrogateescape`` error handler is used to 731encode the data and write it back out. 732 733 734References 735---------- 736 737One section of `Mastering Python 3 Input/Output 738<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_, 739a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling. 740 741The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware 742Applications in Python" 743<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_ 744discuss questions of character encodings as well as how to internationalize 745and localize an application. These slides cover Python 2.x only. 746 747`The Guts of Unicode in Python 748<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_ 749is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode 750representation in Python 3.3. 751 752 753Acknowledgements 754================ 755 756The initial draft of this document was written by Andrew Kuchling. 757It has since been revised further by Alexander Belopolsky, Georg Brandl, 758Andrew Kuchling, and Ezio Melotti. 759 760Thanks to the following people who have noted errors or offered 761suggestions on this article: Éric Araujo, Nicholas Bastin, Nick 762Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André 763Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka, 764Eryk Sun, Chad Whitacre, Graham Wideman. 765