1 2:mod:`codecs` --- Codec registry and base classes 3================================================= 4 5.. module:: codecs 6 :synopsis: Encode and decode data and streams. 7.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> 8.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> 9.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> 10 11 12.. index:: 13 single: Unicode 14 single: Codecs 15 pair: Codecs; encode 16 pair: Codecs; decode 17 single: streams 18 pair: stackable; streams 19 20This module defines base classes for standard Python codecs (encoders and 21decoders) and provides access to the internal Python codec registry which 22manages the codec and error handling lookup process. 23 24It defines the following functions: 25 26.. function:: encode(obj, [encoding[, errors]]) 27 28 Encodes *obj* using the codec registered for *encoding*. The default 29 encoding is ``'ascii'``. 30 31 *Errors* may be given to set the desired error handling scheme. The 32 default error handler is ``'strict'`` meaning that encoding errors raise 33 :exc:`ValueError` (or a more codec specific subclass, such as 34 :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more 35 information on codec error handling. 36 37 .. versionadded:: 2.4 38 39.. function:: decode(obj, [encoding[, errors]]) 40 41 Decodes *obj* using the codec registered for *encoding*. The default 42 encoding is ``'ascii'``. 43 44 *Errors* may be given to set the desired error handling scheme. The 45 default error handler is ``'strict'`` meaning that decoding errors raise 46 :exc:`ValueError` (or a more codec specific subclass, such as 47 :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more 48 information on codec error handling. 49 50 .. versionadded:: 2.4 51 52.. function:: register(search_function) 53 54 Register a codec search function. Search functions are expected to take one 55 argument, the encoding name in all lower case letters, and return a 56 :class:`CodecInfo` object having the following attributes: 57 58 * ``name`` The name of the encoding; 59 60 * ``encode`` The stateless encoding function; 61 62 * ``decode`` The stateless decoding function; 63 64 * ``incrementalencoder`` An incremental encoder class or factory function; 65 66 * ``incrementaldecoder`` An incremental decoder class or factory function; 67 68 * ``streamwriter`` A stream writer class or factory function; 69 70 * ``streamreader`` A stream reader class or factory function. 71 72 The various functions or classes take the following arguments: 73 74 *encode* and *decode*: These must be functions or methods which have the same 75 interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec 76 instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods 77 are expected to work in a stateless mode. 78 79 *incrementalencoder* and *incrementaldecoder*: These have to be factory 80 functions providing the following interface: 81 82 ``factory(errors='strict')`` 83 84 The factory functions must return objects providing the interfaces defined by 85 the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, 86 respectively. Incremental codecs can maintain state. 87 88 *streamreader* and *streamwriter*: These have to be factory functions providing 89 the following interface: 90 91 ``factory(stream, errors='strict')`` 92 93 The factory functions must return objects providing the interfaces defined by 94 the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively. 95 Stream codecs can maintain state. 96 97 Possible values for errors are 98 99 * ``'strict'``: raise an exception in case of an encoding error 100 * ``'replace'``: replace malformed data with a suitable replacement marker, 101 such as ``'?'`` or ``'\ufffd'`` 102 * ``'ignore'``: ignore malformed data and continue without further notice 103 * ``'xmlcharrefreplace'``: replace with the appropriate XML character 104 reference (for encoding only) 105 * ``'backslashreplace'``: replace with backslashed escape sequences (for 106 encoding only) 107 108 as well as any other error handling name defined via :func:`register_error`. 109 110 In case a search function cannot find a given encoding, it should return 111 ``None``. 112 113 114.. function:: lookup(encoding) 115 116 Looks up the codec info in the Python codec registry and returns a 117 :class:`CodecInfo` object as defined above. 118 119 Encodings are first looked up in the registry's cache. If not found, the list of 120 registered search functions is scanned. If no :class:`CodecInfo` object is 121 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object 122 is stored in the cache and returned to the caller. 123 124To simplify access to the various codecs, the module provides these additional 125functions which use :func:`lookup` for the codec lookup: 126 127 128.. function:: getencoder(encoding) 129 130 Look up the codec for the given encoding and return its encoder function. 131 132 Raises a :exc:`LookupError` in case the encoding cannot be found. 133 134 135.. function:: getdecoder(encoding) 136 137 Look up the codec for the given encoding and return its decoder function. 138 139 Raises a :exc:`LookupError` in case the encoding cannot be found. 140 141 142.. function:: getincrementalencoder(encoding) 143 144 Look up the codec for the given encoding and return its incremental encoder 145 class or factory function. 146 147 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec 148 doesn't support an incremental encoder. 149 150 .. versionadded:: 2.5 151 152 153.. function:: getincrementaldecoder(encoding) 154 155 Look up the codec for the given encoding and return its incremental decoder 156 class or factory function. 157 158 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec 159 doesn't support an incremental decoder. 160 161 .. versionadded:: 2.5 162 163 164.. function:: getreader(encoding) 165 166 Look up the codec for the given encoding and return its StreamReader class or 167 factory function. 168 169 Raises a :exc:`LookupError` in case the encoding cannot be found. 170 171 172.. function:: getwriter(encoding) 173 174 Look up the codec for the given encoding and return its StreamWriter class or 175 factory function. 176 177 Raises a :exc:`LookupError` in case the encoding cannot be found. 178 179 180.. function:: register_error(name, error_handler) 181 182 Register the error handling function *error_handler* under the name *name*. 183 *error_handler* will be called during encoding and decoding in case of an error, 184 when *name* is specified as the errors parameter. 185 186 For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError` 187 instance, which contains information about the location of the error. The error 188 handler must either raise this or a different exception or return a tuple with a 189 replacement for the unencodable part of the input and a position where encoding 190 should continue. The encoder will encode the replacement and continue encoding 191 the original input at the specified position. Negative position values will be 192 treated as being relative to the end of the input string. If the resulting 193 position is out of bound an :exc:`IndexError` will be raised. 194 195 Decoding and translating works similar, except :exc:`UnicodeDecodeError` or 196 :exc:`UnicodeTranslateError` will be passed to the handler and that the 197 replacement from the error handler will be put into the output directly. 198 199 200.. function:: lookup_error(name) 201 202 Return the error handler previously registered under the name *name*. 203 204 Raises a :exc:`LookupError` in case the handler cannot be found. 205 206 207.. function:: strict_errors(exception) 208 209 Implements the ``strict`` error handling: each encoding or decoding error 210 raises a :exc:`UnicodeError`. 211 212 213.. function:: replace_errors(exception) 214 215 Implements the ``replace`` error handling: malformed data is replaced with a 216 suitable replacement character such as ``'?'`` in bytestrings and 217 ``'\ufffd'`` in Unicode strings. 218 219 220.. function:: ignore_errors(exception) 221 222 Implements the ``ignore`` error handling: malformed data is ignored and 223 encoding or decoding is continued without further notice. 224 225 226.. function:: xmlcharrefreplace_errors(exception) 227 228 Implements the ``xmlcharrefreplace`` error handling (for encoding only): the 229 unencodable character is replaced by an appropriate XML character reference. 230 231 232.. function:: backslashreplace_errors(exception) 233 234 Implements the ``backslashreplace`` error handling (for encoding only): the 235 unencodable character is replaced by a backslashed escape sequence. 236 237To simplify working with encoded files or stream, the module also defines these 238utility functions: 239 240 241.. function:: open(filename, mode[, encoding[, errors[, buffering]]]) 242 243 Open an encoded file using the given *mode* and return a wrapped version 244 providing transparent encoding/decoding. The default file mode is ``'r'`` 245 meaning to open the file in read mode. 246 247 .. note:: 248 249 The wrapped version will only accept the object format defined by the codecs, 250 i.e. Unicode objects for most built-in codecs. Output is also codec-dependent 251 and will usually be Unicode as well. 252 253 .. note:: 254 255 Files are always opened in binary mode, even if no binary mode was 256 specified. This is done to avoid data loss due to encodings using 8-bit 257 values. This means that no automatic conversion of ``'\n'`` is done 258 on reading and writing. 259 260 *encoding* specifies the encoding which is to be used for the file. 261 262 *errors* may be given to define the error handling. It defaults to ``'strict'`` 263 which causes a :exc:`ValueError` to be raised in case an encoding error occurs. 264 265 *buffering* has the same meaning as for the built-in :func:`open` function. It 266 defaults to line buffered. 267 268 269.. function:: EncodedFile(file, input[, output[, errors]]) 270 271 Return a wrapped version of file which provides transparent encoding 272 translation. 273 274 Strings written to the wrapped file are interpreted according to the given 275 *input* encoding and then written to the original file as strings using the 276 *output* encoding. The intermediate encoding will usually be Unicode but depends 277 on the specified codecs. 278 279 If *output* is not given, it defaults to *input*. 280 281 *errors* may be given to define the error handling. It defaults to ``'strict'``, 282 which causes :exc:`ValueError` to be raised in case an encoding error occurs. 283 284 285.. function:: iterencode(iterable, encoding[, errors]) 286 287 Uses an incremental encoder to iteratively encode the input provided by 288 *iterable*. This function is a :term:`generator`. *errors* (as well as any 289 other keyword argument) is passed through to the incremental encoder. 290 291 .. versionadded:: 2.5 292 293 294.. function:: iterdecode(iterable, encoding[, errors]) 295 296 Uses an incremental decoder to iteratively decode the input provided by 297 *iterable*. This function is a :term:`generator`. *errors* (as well as any 298 other keyword argument) is passed through to the incremental decoder. 299 300 .. versionadded:: 2.5 301 302The module also provides the following constants which are useful for reading 303and writing to platform dependent files: 304 305 306.. data:: BOM 307 BOM_BE 308 BOM_LE 309 BOM_UTF8 310 BOM_UTF16 311 BOM_UTF16_BE 312 BOM_UTF16_LE 313 BOM_UTF32 314 BOM_UTF32_BE 315 BOM_UTF32_LE 316 317 These constants define various encodings of the Unicode byte order mark (BOM) 318 used in UTF-16 and UTF-32 data streams to indicate the byte order used in the 319 stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either 320 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's 321 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, 322 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for 323 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32 324 encodings. 325 326 327.. _codec-base-classes: 328 329Codec Base Classes 330------------------ 331 332The :mod:`codecs` module defines a set of base classes which define the 333interface and can also be used to easily write your own codecs for use in 334Python. 335 336Each codec has to define four interfaces to make it usable as codec in Python: 337stateless encoder, stateless decoder, stream reader and stream writer. The 338stream reader and writers typically reuse the stateless encoder/decoder to 339implement the file protocols. 340 341The :class:`Codec` class defines the interface for stateless encoders/decoders. 342 343To simplify and standardize error handling, the :meth:`~Codec.encode` and 344:meth:`~Codec.decode` methods may implement different error handling schemes by 345providing the *errors* string argument. The following string values are defined 346and implemented by all standard Python codecs: 347 348.. tabularcolumns:: |l|L| 349 350+-------------------------+-----------------------------------------------+ 351| Value | Meaning | 352+=========================+===============================================+ 353| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); | 354| | this is the default. | 355+-------------------------+-----------------------------------------------+ 356| ``'ignore'`` | Ignore the character and continue with the | 357| | next. | 358+-------------------------+-----------------------------------------------+ 359| ``'replace'`` | Replace with a suitable replacement | 360| | character; Python will use the official | 361| | U+FFFD REPLACEMENT CHARACTER for the built-in | 362| | Unicode codecs on decoding and '?' on | 363| | encoding. | 364+-------------------------+-----------------------------------------------+ 365| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character | 366| | reference (only for encoding). | 367+-------------------------+-----------------------------------------------+ 368| ``'backslashreplace'`` | Replace with backslashed escape sequences | 369| | (only for encoding). | 370+-------------------------+-----------------------------------------------+ 371 372The set of allowed values can be extended via :meth:`register_error`. 373 374 375.. _codec-objects: 376 377Codec Objects 378^^^^^^^^^^^^^ 379 380The :class:`Codec` class defines these methods which also define the function 381interfaces of the stateless encoder and decoder: 382 383 384.. method:: Codec.encode(input[, errors]) 385 386 Encodes the object *input* and returns a tuple (output object, length consumed). 387 While codecs are not restricted to use with Unicode, in a Unicode context, 388 encoding converts a Unicode object to a plain string using a particular 389 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``). 390 391 *errors* defines the error handling to apply. It defaults to ``'strict'`` 392 handling. 393 394 The method may not store state in the :class:`Codec` instance. Use 395 :class:`StreamWriter` for codecs which have to keep state in order to make 396 encoding efficient. 397 398 The encoder must be able to handle zero length input and return an empty object 399 of the output object type in this situation. 400 401 402.. method:: Codec.decode(input[, errors]) 403 404 Decodes the object *input* and returns a tuple (output object, length consumed). 405 In a Unicode context, decoding converts a plain string encoded using a 406 particular character set encoding to a Unicode object. 407 408 *input* must be an object which provides the ``bf_getreadbuf`` buffer slot. 409 Python strings, buffer objects and memory mapped files are examples of objects 410 providing this slot. 411 412 *errors* defines the error handling to apply. It defaults to ``'strict'`` 413 handling. 414 415 The method may not store state in the :class:`Codec` instance. Use 416 :class:`StreamReader` for codecs which have to keep state in order to make 417 decoding efficient. 418 419 The decoder must be able to handle zero length input and return an empty object 420 of the output object type in this situation. 421 422The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide 423the basic interface for incremental encoding and decoding. Encoding/decoding the 424input isn't done with one call to the stateless encoder/decoder function, but 425with multiple calls to the 426:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of 427the incremental encoder/decoder. The incremental encoder/decoder keeps track of 428the encoding/decoding process during method calls. 429 430The joined output of calls to the 431:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is 432the same as if all the single inputs were joined into one, and this input was 433encoded/decoded with the stateless encoder/decoder. 434 435 436.. _incremental-encoder-objects: 437 438IncrementalEncoder Objects 439^^^^^^^^^^^^^^^^^^^^^^^^^^ 440 441.. versionadded:: 2.5 442 443The :class:`IncrementalEncoder` class is used for encoding an input in multiple 444steps. It defines the following methods which every incremental encoder must 445define in order to be compatible with the Python codec registry. 446 447 448.. class:: IncrementalEncoder([errors]) 449 450 Constructor for an :class:`IncrementalEncoder` instance. 451 452 All incremental encoders must provide this constructor interface. They are free 453 to add additional keyword arguments, but only the ones defined here are used by 454 the Python codec registry. 455 456 The :class:`IncrementalEncoder` may implement different error handling schemes 457 by providing the *errors* keyword argument. These parameters are predefined: 458 459 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. 460 461 * ``'ignore'`` Ignore the character and continue with the next. 462 463 * ``'replace'`` Replace with a suitable replacement character 464 465 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference 466 467 * ``'backslashreplace'`` Replace with backslashed escape sequences. 468 469 The *errors* argument will be assigned to an attribute of the same name. 470 Assigning to this attribute makes it possible to switch between different error 471 handling strategies during the lifetime of the :class:`IncrementalEncoder` 472 object. 473 474 The set of allowed values for the *errors* argument can be extended with 475 :func:`register_error`. 476 477 478 .. method:: encode(object[, final]) 479 480 Encodes *object* (taking the current state of the encoder into account) 481 and returns the resulting encoded object. If this is the last call to 482 :meth:`encode` *final* must be true (the default is false). 483 484 485 .. method:: reset() 486 487 Reset the encoder to the initial state. 488 489 490.. _incremental-decoder-objects: 491 492IncrementalDecoder Objects 493^^^^^^^^^^^^^^^^^^^^^^^^^^ 494 495The :class:`IncrementalDecoder` class is used for decoding an input in multiple 496steps. It defines the following methods which every incremental decoder must 497define in order to be compatible with the Python codec registry. 498 499 500.. class:: IncrementalDecoder([errors]) 501 502 Constructor for an :class:`IncrementalDecoder` instance. 503 504 All incremental decoders must provide this constructor interface. They are free 505 to add additional keyword arguments, but only the ones defined here are used by 506 the Python codec registry. 507 508 The :class:`IncrementalDecoder` may implement different error handling schemes 509 by providing the *errors* keyword argument. These parameters are predefined: 510 511 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. 512 513 * ``'ignore'`` Ignore the character and continue with the next. 514 515 * ``'replace'`` Replace with a suitable replacement character. 516 517 The *errors* argument will be assigned to an attribute of the same name. 518 Assigning to this attribute makes it possible to switch between different error 519 handling strategies during the lifetime of the :class:`IncrementalDecoder` 520 object. 521 522 The set of allowed values for the *errors* argument can be extended with 523 :func:`register_error`. 524 525 526 .. method:: decode(object[, final]) 527 528 Decodes *object* (taking the current state of the decoder into account) 529 and returns the resulting decoded object. If this is the last call to 530 :meth:`decode` *final* must be true (the default is false). If *final* is 531 true the decoder must decode the input completely and must flush all 532 buffers. If this isn't possible (e.g. because of incomplete byte sequences 533 at the end of the input) it must initiate error handling just like in the 534 stateless case (which might raise an exception). 535 536 537 .. method:: reset() 538 539 Reset the decoder to the initial state. 540 541 542The :class:`StreamWriter` and :class:`StreamReader` classes provide generic 543working interfaces which can be used to implement new encoding submodules very 544easily. See :mod:`encodings.utf_8` for an example of how this is done. 545 546 547.. _stream-writer-objects: 548 549StreamWriter Objects 550^^^^^^^^^^^^^^^^^^^^ 551 552The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the 553following methods which every stream writer must define in order to be 554compatible with the Python codec registry. 555 556 557.. class:: StreamWriter(stream[, errors]) 558 559 Constructor for a :class:`StreamWriter` instance. 560 561 All stream writers must provide this constructor interface. They are free to add 562 additional keyword arguments, but only the ones defined here are used by the 563 Python codec registry. 564 565 *stream* must be a file-like object open for writing binary data. 566 567 The :class:`StreamWriter` may implement different error handling schemes by 568 providing the *errors* keyword argument. These parameters are predefined: 569 570 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. 571 572 * ``'ignore'`` Ignore the character and continue with the next. 573 574 * ``'replace'`` Replace with a suitable replacement character 575 576 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference 577 578 * ``'backslashreplace'`` Replace with backslashed escape sequences. 579 580 The *errors* argument will be assigned to an attribute of the same name. 581 Assigning to this attribute makes it possible to switch between different error 582 handling strategies during the lifetime of the :class:`StreamWriter` object. 583 584 The set of allowed values for the *errors* argument can be extended with 585 :func:`register_error`. 586 587 588 .. method:: write(object) 589 590 Writes the object's contents encoded to the stream. 591 592 593 .. method:: writelines(list) 594 595 Writes the concatenated list of strings to the stream (possibly by reusing 596 the :meth:`write` method). 597 598 599 .. method:: reset() 600 601 Flushes and resets the codec buffers used for keeping state. 602 603 Calling this method should ensure that the data on the output is put into 604 a clean state that allows appending of new fresh data without having to 605 rescan the whole stream to recover state. 606 607 608In addition to the above methods, the :class:`StreamWriter` must also inherit 609all other methods and attributes from the underlying stream. 610 611 612.. _stream-reader-objects: 613 614StreamReader Objects 615^^^^^^^^^^^^^^^^^^^^ 616 617The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the 618following methods which every stream reader must define in order to be 619compatible with the Python codec registry. 620 621 622.. class:: StreamReader(stream[, errors]) 623 624 Constructor for a :class:`StreamReader` instance. 625 626 All stream readers must provide this constructor interface. They are free to add 627 additional keyword arguments, but only the ones defined here are used by the 628 Python codec registry. 629 630 *stream* must be a file-like object open for reading (binary) data. 631 632 The :class:`StreamReader` may implement different error handling schemes by 633 providing the *errors* keyword argument. These parameters are defined: 634 635 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. 636 637 * ``'ignore'`` Ignore the character and continue with the next. 638 639 * ``'replace'`` Replace with a suitable replacement character. 640 641 The *errors* argument will be assigned to an attribute of the same name. 642 Assigning to this attribute makes it possible to switch between different error 643 handling strategies during the lifetime of the :class:`StreamReader` object. 644 645 The set of allowed values for the *errors* argument can be extended with 646 :func:`register_error`. 647 648 649 .. method:: read([size[, chars, [firstline]]]) 650 651 Decodes data from the stream and returns the resulting object. 652 653 *chars* indicates the number of characters to read from the 654 stream. :func:`read` will never return more than *chars* characters, but 655 it might return less, if there are not enough characters available. 656 657 *size* indicates the approximate maximum number of bytes to read from the 658 stream for decoding purposes. The decoder can modify this setting as 659 appropriate. The default value -1 indicates to read and decode as much as 660 possible. *size* is intended to prevent having to decode huge files in 661 one step. 662 663 *firstline* indicates that it would be sufficient to only return the first 664 line, if there are decoding errors on later lines. 665 666 The method should use a greedy read strategy meaning that it should read 667 as much data as is allowed within the definition of the encoding and the 668 given size, e.g. if optional encoding endings or state markers are 669 available on the stream, these should be read too. 670 671 .. versionchanged:: 2.4 672 *chars* argument added. 673 674 .. versionchanged:: 2.4.2 675 *firstline* argument added. 676 677 678 .. method:: readline([size[, keepends]]) 679 680 Read one line from the input stream and return the decoded data. 681 682 *size*, if given, is passed as size argument to the stream's 683 :meth:`read` method. 684 685 If *keepends* is false line-endings will be stripped from the lines 686 returned. 687 688 .. versionchanged:: 2.4 689 *keepends* argument added. 690 691 692 .. method:: readlines([sizehint[, keepends]]) 693 694 Read all lines available on the input stream and return them as a list of 695 lines. 696 697 Line-endings are implemented using the codec's decoder method and are 698 included in the list entries if *keepends* is true. 699 700 *sizehint*, if given, is passed as the *size* argument to the stream's 701 :meth:`read` method. 702 703 704 .. method:: reset() 705 706 Resets the codec buffers used for keeping state. 707 708 Note that no stream repositioning should take place. This method is 709 primarily intended to be able to recover from decoding errors. 710 711 712In addition to the above methods, the :class:`StreamReader` must also inherit 713all other methods and attributes from the underlying stream. 714 715The next two base classes are included for convenience. They are not needed by 716the codec registry, but may provide useful in practice. 717 718 719.. _stream-reader-writer: 720 721StreamReaderWriter Objects 722^^^^^^^^^^^^^^^^^^^^^^^^^^ 723 724The :class:`StreamReaderWriter` allows wrapping streams which work in both read 725and write modes. 726 727The design is such that one can use the factory functions returned by the 728:func:`lookup` function to construct the instance. 729 730 731.. class:: StreamReaderWriter(stream, Reader, Writer, errors) 732 733 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like 734 object. *Reader* and *Writer* must be factory functions or classes providing the 735 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling 736 is done in the same way as defined for the stream readers and writers. 737 738:class:`StreamReaderWriter` instances define the combined interfaces of 739:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other 740methods and attributes from the underlying stream. 741 742 743.. _stream-recoder-objects: 744 745StreamRecoder Objects 746^^^^^^^^^^^^^^^^^^^^^ 747 748The :class:`StreamRecoder` provide a frontend - backend view of encoding data 749which is sometimes useful when dealing with different encoding environments. 750 751The design is such that one can use the factory functions returned by the 752:func:`lookup` function to construct the instance. 753 754 755.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors) 756 757 Creates a :class:`StreamRecoder` instance which implements a two-way conversion: 758 *encode* and *decode* work on the frontend (the input to :meth:`read` and output 759 of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and 760 writing to the stream). 761 762 You can use these objects to do transparent direct recodings from e.g. Latin-1 763 to UTF-8 and back. 764 765 *stream* must be a file-like object. 766 767 *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, 768 *Writer* must be factory functions or classes providing objects of the 769 :class:`StreamReader` and :class:`StreamWriter` interface respectively. 770 771 *encode* and *decode* are needed for the frontend translation, *Reader* and 772 *Writer* for the backend translation. The intermediate format used is 773 determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode 774 as the intermediate encoding. 775 776 Error handling is done in the same way as defined for the stream readers and 777 writers. 778 779 780:class:`StreamRecoder` instances define the combined interfaces of 781:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other 782methods and attributes from the underlying stream. 783 784 785.. _encodings-overview: 786 787Encodings and Unicode 788--------------------- 789 790Unicode strings are stored internally as sequences of code points (to be precise 791as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either 792via ``--enable-unicode=ucs2`` or ``--enable-unicode=ucs4``, with the 793former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data 794type. Once a Unicode object is used outside of CPU and memory, CPU endianness 795and how these arrays are stored as bytes become an issue. Transforming a 796unicode object into a sequence of bytes is called encoding and recreating the 797unicode object from the sequence of bytes is known as decoding. There are many 798different methods for how this transformation can be done (these methods are 799also called encodings). The simplest method is to map the code points 0--255 to 800the bytes ``0x0``--``0xff``. This means that a unicode object that contains 801code points above ``U+00FF`` can't be encoded with this method (which is called 802``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a 803:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1' 804codec can't encode character u'\u1234' in position 3: ordinal not in 805range(256)``. 806 807There's another group of encodings (the so called charmap encodings) that choose 808a different subset of all unicode code points and how these code points are 809mapped to the bytes ``0x0``--``0xff``. To see how this is done simply open 810e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on 811Windows). There's a string constant with 256 characters that shows you which 812character is mapped to which byte value. 813 814All of these encodings can only encode 256 of the 1114112 code points 815defined in unicode. A simple and straightforward way that can store each Unicode 816code point, is to store each code point as four consecutive bytes. There are two 817possibilities: store the bytes in big endian or in little endian order. These 818two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their 819disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you 820will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this 821problem: bytes will always be in natural endianness. When these bytes are read 822by a CPU with a different endianness, then bytes have to be swapped though. To 823be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, 824there's the so called BOM ("Byte Order Mark"). This is the Unicode character 825``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` 826byte sequence. The byte swapped version of this character (``0xFFFE``) is an 827illegal character that may not appear in a Unicode text. So when the 828first character in an ``UTF-16`` or ``UTF-32`` byte sequence 829appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. 830Unfortunately the character ``U+FEFF`` had a second purpose as 831a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow 832a word to be split. It can e.g. be used to give hints to a ligature algorithm. 833With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been 834deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless 835Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM 836it's a device to determine the storage layout of the encoded bytes, and vanishes 837once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH 838NO-BREAK SPACE`` it's a normal character that will be decoded like any other. 839 840There's another encoding that is able to encoding the full range of Unicode 841characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues 842with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two 843parts: marker bits (the most significant bits) and payload bits. The marker bits 844are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are 845encoded like this (with x being payload bits, which when concatenated give the 846Unicode character): 847 848+-----------------------------------+----------------------------------------------+ 849| Range | Encoding | 850+===================================+==============================================+ 851| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx | 852+-----------------------------------+----------------------------------------------+ 853| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx | 854+-----------------------------------+----------------------------------------------+ 855| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx | 856+-----------------------------------+----------------------------------------------+ 857| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 858+-----------------------------------+----------------------------------------------+ 859 860The least significant bit of the Unicode character is the rightmost x bit. 861 862As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in 863the decoded Unicode string (even if it's the first character) is treated as a 864``ZERO WIDTH NO-BREAK SPACE``. 865 866Without external information it's impossible to reliably determine which 867encoding was used for encoding a Unicode string. Each charmap encoding can 868decode any random byte sequence. However that's not possible with UTF-8, as 869UTF-8 byte sequences have a structure that doesn't allow arbitrary byte 870sequences. To increase the reliability with which a UTF-8 encoding can be 871detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls 872``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters 873is written to the file, a UTF-8 encoded BOM (which looks like this as a byte 874sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable 875that any charmap encoded file starts with these byte values (which would e.g. 876map to 877 878 | LATIN SMALL LETTER I WITH DIAERESIS 879 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 880 | INVERTED QUESTION MARK 881 882in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be 883correctly guessed from the byte sequence. So here the BOM is not used to be able 884to determine the byte order used for generating the byte sequence, but as a 885signature that helps in guessing the encoding. On encoding the utf-8-sig codec 886will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On 887decoding ``utf-8-sig`` will skip those three bytes if they appear as the first 888three bytes in the file. In UTF-8, the use of the BOM is discouraged and 889should generally be avoided. 890 891 892.. _standard-encodings: 893 894Standard Encodings 895------------------ 896 897Python comes with a number of codecs built-in, either implemented as C functions 898or with dictionaries as mapping tables. The following table lists the codecs by 899name, together with a few common aliases, and the languages for which the 900encoding is likely used. Neither the list of aliases nor the list of languages 901is meant to be exhaustive. Notice that spelling alternatives that only differ in 902case or use a hyphen instead of an underscore are also valid aliases; therefore, 903e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec. 904 905Many of the character sets support the same languages. They vary in individual 906characters (e.g. whether the EURO SIGN is supported or not), and in the 907assignment of characters to code positions. For the European languages in 908particular, the following variants typically exist: 909 910* an ISO 8859 codeset 911 912* a Microsoft Windows code page, which is typically derived from an 8859 codeset, 913 but replaces control characters with additional graphic characters 914 915* an IBM EBCDIC code page 916 917* an IBM PC code page, which is ASCII compatible 918 919.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}| 920 921+-----------------+--------------------------------+--------------------------------+ 922| Codec | Aliases | Languages | 923+=================+================================+================================+ 924| ascii | 646, us-ascii | English | 925+-----------------+--------------------------------+--------------------------------+ 926| big5 | big5-tw, csbig5 | Traditional Chinese | 927+-----------------+--------------------------------+--------------------------------+ 928| big5hkscs | big5-hkscs, hkscs | Traditional Chinese | 929+-----------------+--------------------------------+--------------------------------+ 930| cp037 | IBM037, IBM039 | English | 931+-----------------+--------------------------------+--------------------------------+ 932| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew | 933+-----------------+--------------------------------+--------------------------------+ 934| cp437 | 437, IBM437 | English | 935+-----------------+--------------------------------+--------------------------------+ 936| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe | 937| | IBM500 | | 938+-----------------+--------------------------------+--------------------------------+ 939| cp720 | | Arabic | 940+-----------------+--------------------------------+--------------------------------+ 941| cp737 | | Greek | 942+-----------------+--------------------------------+--------------------------------+ 943| cp775 | IBM775 | Baltic languages | 944+-----------------+--------------------------------+--------------------------------+ 945| cp850 | 850, IBM850 | Western Europe | 946+-----------------+--------------------------------+--------------------------------+ 947| cp852 | 852, IBM852 | Central and Eastern Europe | 948+-----------------+--------------------------------+--------------------------------+ 949| cp855 | 855, IBM855 | Bulgarian, Byelorussian, | 950| | | Macedonian, Russian, Serbian | 951+-----------------+--------------------------------+--------------------------------+ 952| cp856 | | Hebrew | 953+-----------------+--------------------------------+--------------------------------+ 954| cp857 | 857, IBM857 | Turkish | 955+-----------------+--------------------------------+--------------------------------+ 956| cp858 | 858, IBM858 | Western Europe | 957+-----------------+--------------------------------+--------------------------------+ 958| cp860 | 860, IBM860 | Portuguese | 959+-----------------+--------------------------------+--------------------------------+ 960| cp861 | 861, CP-IS, IBM861 | Icelandic | 961+-----------------+--------------------------------+--------------------------------+ 962| cp862 | 862, IBM862 | Hebrew | 963+-----------------+--------------------------------+--------------------------------+ 964| cp863 | 863, IBM863 | Canadian | 965+-----------------+--------------------------------+--------------------------------+ 966| cp864 | IBM864 | Arabic | 967+-----------------+--------------------------------+--------------------------------+ 968| cp865 | 865, IBM865 | Danish, Norwegian | 969+-----------------+--------------------------------+--------------------------------+ 970| cp866 | 866, IBM866 | Russian | 971+-----------------+--------------------------------+--------------------------------+ 972| cp869 | 869, CP-GR, IBM869 | Greek | 973+-----------------+--------------------------------+--------------------------------+ 974| cp874 | | Thai | 975+-----------------+--------------------------------+--------------------------------+ 976| cp875 | | Greek | 977+-----------------+--------------------------------+--------------------------------+ 978| cp932 | 932, ms932, mskanji, ms-kanji | Japanese | 979+-----------------+--------------------------------+--------------------------------+ 980| cp949 | 949, ms949, uhc | Korean | 981+-----------------+--------------------------------+--------------------------------+ 982| cp950 | 950, ms950 | Traditional Chinese | 983+-----------------+--------------------------------+--------------------------------+ 984| cp1006 | | Urdu | 985+-----------------+--------------------------------+--------------------------------+ 986| cp1026 | ibm1026 | Turkish | 987+-----------------+--------------------------------+--------------------------------+ 988| cp1140 | ibm1140 | Western Europe | 989+-----------------+--------------------------------+--------------------------------+ 990| cp1250 | windows-1250 | Central and Eastern Europe | 991+-----------------+--------------------------------+--------------------------------+ 992| cp1251 | windows-1251 | Bulgarian, Byelorussian, | 993| | | Macedonian, Russian, Serbian | 994+-----------------+--------------------------------+--------------------------------+ 995| cp1252 | windows-1252 | Western Europe | 996+-----------------+--------------------------------+--------------------------------+ 997| cp1253 | windows-1253 | Greek | 998+-----------------+--------------------------------+--------------------------------+ 999| cp1254 | windows-1254 | Turkish | 1000+-----------------+--------------------------------+--------------------------------+ 1001| cp1255 | windows-1255 | Hebrew | 1002+-----------------+--------------------------------+--------------------------------+ 1003| cp1256 | windows-1256 | Arabic | 1004+-----------------+--------------------------------+--------------------------------+ 1005| cp1257 | windows-1257 | Baltic languages | 1006+-----------------+--------------------------------+--------------------------------+ 1007| cp1258 | windows-1258 | Vietnamese | 1008+-----------------+--------------------------------+--------------------------------+ 1009| euc_jp | eucjp, ujis, u-jis | Japanese | 1010+-----------------+--------------------------------+--------------------------------+ 1011| euc_jis_2004 | jisx0213, eucjis2004 | Japanese | 1012+-----------------+--------------------------------+--------------------------------+ 1013| euc_jisx0213 | eucjisx0213 | Japanese | 1014+-----------------+--------------------------------+--------------------------------+ 1015| euc_kr | euckr, korean, ksc5601, | Korean | 1016| | ks_c-5601, ks_c-5601-1987, | | 1017| | ksx1001, ks_x-1001 | | 1018+-----------------+--------------------------------+--------------------------------+ 1019| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese | 1020| | cn, euccn, eucgb2312-cn, | | 1021| | gb2312-1980, gb2312-80, iso- | | 1022| | ir-58 | | 1023+-----------------+--------------------------------+--------------------------------+ 1024| gbk | 936, cp936, ms936 | Unified Chinese | 1025+-----------------+--------------------------------+--------------------------------+ 1026| gb18030 | gb18030-2000 | Unified Chinese | 1027+-----------------+--------------------------------+--------------------------------+ 1028| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese | 1029+-----------------+--------------------------------+--------------------------------+ 1030| iso2022_jp | csiso2022jp, iso2022jp, | Japanese | 1031| | iso-2022-jp | | 1032+-----------------+--------------------------------+--------------------------------+ 1033| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese | 1034+-----------------+--------------------------------+--------------------------------+ 1035| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified | 1036| | | Chinese, Western Europe, Greek | 1037+-----------------+--------------------------------+--------------------------------+ 1038| iso2022_jp_2004 | iso2022jp-2004, | Japanese | 1039| | iso-2022-jp-2004 | | 1040+-----------------+--------------------------------+--------------------------------+ 1041| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese | 1042+-----------------+--------------------------------+--------------------------------+ 1043| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese | 1044+-----------------+--------------------------------+--------------------------------+ 1045| iso2022_kr | csiso2022kr, iso2022kr, | Korean | 1046| | iso-2022-kr | | 1047+-----------------+--------------------------------+--------------------------------+ 1048| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe | 1049| | cp819, latin, latin1, L1 | | 1050+-----------------+--------------------------------+--------------------------------+ 1051| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe | 1052+-----------------+--------------------------------+--------------------------------+ 1053| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese | 1054+-----------------+--------------------------------+--------------------------------+ 1055| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages | 1056+-----------------+--------------------------------+--------------------------------+ 1057| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, | 1058| | | Macedonian, Russian, Serbian | 1059+-----------------+--------------------------------+--------------------------------+ 1060| iso8859_6 | iso-8859-6, arabic | Arabic | 1061+-----------------+--------------------------------+--------------------------------+ 1062| iso8859_7 | iso-8859-7, greek, greek8 | Greek | 1063+-----------------+--------------------------------+--------------------------------+ 1064| iso8859_8 | iso-8859-8, hebrew | Hebrew | 1065+-----------------+--------------------------------+--------------------------------+ 1066| iso8859_9 | iso-8859-9, latin5, L5 | Turkish | 1067+-----------------+--------------------------------+--------------------------------+ 1068| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages | 1069+-----------------+--------------------------------+--------------------------------+ 1070| iso8859_11 | iso-8859-11, thai | Thai languages | 1071+-----------------+--------------------------------+--------------------------------+ 1072| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages | 1073+-----------------+--------------------------------+--------------------------------+ 1074| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages | 1075+-----------------+--------------------------------+--------------------------------+ 1076| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe | 1077+-----------------+--------------------------------+--------------------------------+ 1078| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe | 1079+-----------------+--------------------------------+--------------------------------+ 1080| johab | cp1361, ms1361 | Korean | 1081+-----------------+--------------------------------+--------------------------------+ 1082| koi8_r | | Russian | 1083+-----------------+--------------------------------+--------------------------------+ 1084| koi8_u | | Ukrainian | 1085+-----------------+--------------------------------+--------------------------------+ 1086| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, | 1087| | | Macedonian, Russian, Serbian | 1088+-----------------+--------------------------------+--------------------------------+ 1089| mac_greek | macgreek | Greek | 1090+-----------------+--------------------------------+--------------------------------+ 1091| mac_iceland | maciceland | Icelandic | 1092+-----------------+--------------------------------+--------------------------------+ 1093| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe | 1094+-----------------+--------------------------------+--------------------------------+ 1095| mac_roman | macroman | Western Europe | 1096+-----------------+--------------------------------+--------------------------------+ 1097| mac_turkish | macturkish | Turkish | 1098+-----------------+--------------------------------+--------------------------------+ 1099| ptcp154 | csptcp154, pt154, cp154, | Kazakh | 1100| | cyrillic-asian | | 1101+-----------------+--------------------------------+--------------------------------+ 1102| shift_jis | csshiftjis, shiftjis, sjis, | Japanese | 1103| | s_jis | | 1104+-----------------+--------------------------------+--------------------------------+ 1105| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese | 1106| | sjis2004 | | 1107+-----------------+--------------------------------+--------------------------------+ 1108| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese | 1109| | s_jisx0213 | | 1110+-----------------+--------------------------------+--------------------------------+ 1111| utf_32 | U32, utf32 | all languages | 1112+-----------------+--------------------------------+--------------------------------+ 1113| utf_32_be | UTF-32BE | all languages | 1114+-----------------+--------------------------------+--------------------------------+ 1115| utf_32_le | UTF-32LE | all languages | 1116+-----------------+--------------------------------+--------------------------------+ 1117| utf_16 | U16, utf16 | all languages | 1118+-----------------+--------------------------------+--------------------------------+ 1119| utf_16_be | UTF-16BE | all languages (BMP only) | 1120+-----------------+--------------------------------+--------------------------------+ 1121| utf_16_le | UTF-16LE | all languages (BMP only) | 1122+-----------------+--------------------------------+--------------------------------+ 1123| utf_7 | U7, unicode-1-1-utf-7 | all languages | 1124+-----------------+--------------------------------+--------------------------------+ 1125| utf_8 | U8, UTF, utf8 | all languages | 1126+-----------------+--------------------------------+--------------------------------+ 1127| utf_8_sig | | all languages | 1128+-----------------+--------------------------------+--------------------------------+ 1129 1130Python Specific Encodings 1131------------------------- 1132 1133A number of predefined codecs are specific to Python, so their codec names have 1134no meaning outside Python. These are listed in the tables below based on the 1135expected input and output types (note that while text encodings are the most 1136common use case for codecs, the underlying codec infrastructure supports 1137arbitrary data transforms rather than just text encodings). For asymmetric 1138codecs, the stated purpose describes the encoding direction. 1139 1140The following codecs provide unicode-to-str encoding [#encoding-note]_ and 1141str-to-unicode decoding [#decoding-note]_, similar to the Unicode text 1142encodings. 1143 1144.. tabularcolumns:: |l|L|L| 1145 1146+--------------------+---------------------------+---------------------------+ 1147| Codec | Aliases | Purpose | 1148+====================+===========================+===========================+ 1149| idna | | Implements :rfc:`3490`, | 1150| | | see also | 1151| | | :mod:`encodings.idna` | 1152+--------------------+---------------------------+---------------------------+ 1153| mbcs | dbcs | Windows only: Encode | 1154| | | operand according to the | 1155| | | ANSI codepage (CP_ACP) | 1156+--------------------+---------------------------+---------------------------+ 1157| palmos | | Encoding of PalmOS 3.5 | 1158+--------------------+---------------------------+---------------------------+ 1159| punycode | | Implements :rfc:`3492` | 1160+--------------------+---------------------------+---------------------------+ 1161| raw_unicode_escape | | Produce a string that is | 1162| | | suitable as raw Unicode | 1163| | | literal in Python source | 1164| | | code | 1165+--------------------+---------------------------+---------------------------+ 1166| rot_13 | rot13 | Returns the Caesar-cypher | 1167| | | encryption of the operand | 1168+--------------------+---------------------------+---------------------------+ 1169| undefined | | Raise an exception for | 1170| | | all conversions. Can be | 1171| | | used as the system | 1172| | | encoding if no automatic | 1173| | | :term:`coercion` between | 1174| | | byte and Unicode strings | 1175| | | is desired. | 1176+--------------------+---------------------------+---------------------------+ 1177| unicode_escape | | Produce a string that is | 1178| | | suitable as Unicode | 1179| | | literal in Python source | 1180| | | code | 1181+--------------------+---------------------------+---------------------------+ 1182| unicode_internal | | Return the internal | 1183| | | representation of the | 1184| | | operand | 1185+--------------------+---------------------------+---------------------------+ 1186 1187.. versionadded:: 2.3 1188 The ``idna`` and ``punycode`` encodings. 1189 1190The following codecs provide str-to-str encoding and decoding 1191[#decoding-note]_. 1192 1193.. tabularcolumns:: |l|L|L|L| 1194 1195+--------------------+---------------------------+---------------------------+------------------------------+ 1196| Codec | Aliases | Purpose | Encoder/decoder | 1197+====================+===========================+===========================+==============================+ 1198| base64_codec | base64, base-64 | Convert operand to | :meth:`base64.encodestring`, | 1199| | | multiline MIME base64 (the| :meth:`base64.decodestring` | 1200| | | result always includes a | | 1201| | | trailing ``'\n'``) | | 1202+--------------------+---------------------------+---------------------------+------------------------------+ 1203| bz2_codec | bz2 | Compress the operand | :meth:`bz2.compress`, | 1204| | | using bz2 | :meth:`bz2.decompress` | 1205+--------------------+---------------------------+---------------------------+------------------------------+ 1206| hex_codec | hex | Convert operand to | :meth:`binascii.b2a_hex`, | 1207| | | hexadecimal | :meth:`binascii.a2b_hex` | 1208| | | representation, with two | | 1209| | | digits per byte | | 1210+--------------------+---------------------------+---------------------------+------------------------------+ 1211| quopri_codec | quopri, quoted-printable, | Convert operand to MIME | :meth:`quopri.encode` with | 1212| | quotedprintable | quoted printable | ``quotetabs=True``, | 1213| | | | :meth:`quopri.decode` | 1214+--------------------+---------------------------+---------------------------+------------------------------+ 1215| string_escape | | Produce a string that is | | 1216| | | suitable as string | | 1217| | | literal in Python source | | 1218| | | code | | 1219+--------------------+---------------------------+---------------------------+------------------------------+ 1220| uu_codec | uu | Convert the operand using | :meth:`uu.encode`, | 1221| | | uuencode | :meth:`uu.decode` | 1222+--------------------+---------------------------+---------------------------+------------------------------+ 1223| zlib_codec | zip, zlib | Compress the operand | :meth:`zlib.compress`, | 1224| | | using gzip | :meth:`zlib.decompress` | 1225+--------------------+---------------------------+---------------------------+------------------------------+ 1226 1227.. [#encoding-note] str objects are also accepted as input in place of unicode 1228 objects. They are implicitly converted to unicode by decoding them using 1229 the default encoding. If this conversion fails, it may lead to encoding 1230 operations raising :exc:`UnicodeDecodeError`. 1231 1232.. [#decoding-note] unicode objects are also accepted as input in place of str 1233 objects. They are implicitly converted to str by encoding them using the 1234 default encoding. If this conversion fails, it may lead to decoding 1235 operations raising :exc:`UnicodeEncodeError`. 1236 1237 1238:mod:`encodings.idna` --- Internationalized Domain Names in Applications 1239------------------------------------------------------------------------ 1240 1241.. module:: encodings.idna 1242 :synopsis: Internationalized Domain Names implementation 1243.. moduleauthor:: Martin v. Löwis 1244 1245.. versionadded:: 2.3 1246 1247This module implements :rfc:`3490` (Internationalized Domain Names in 1248Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for 1249Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding 1250and :mod:`stringprep`. 1251 1252These RFCs together define a protocol to support non-ASCII characters in domain 1253names. A domain name containing non-ASCII characters (such as 1254``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding 1255(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain 1256name is then used in all places where arbitrary characters are not allowed by 1257the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so 1258on. This conversion is carried out in the application; if possible invisible to 1259the user: The application should transparently convert Unicode domain labels to 1260IDNA on the wire, and convert back ACE labels to Unicode before presenting them 1261to the user. 1262 1263Python supports this conversion in several ways: the ``idna`` codec performs 1264conversion between Unicode and ACE, separating an input string into labels 1265based on the separator characters defined in `section 3.1`_ (1) of :rfc:`3490` 1266and converting each label to ACE as required, and conversely separating an input 1267byte string into labels based on the ``.`` separator and converting any ACE 1268labels found into unicode. Furthermore, the :mod:`socket` module 1269transparently converts Unicode host names to ACE, so that applications need not 1270be concerned about converting host names themselves when they pass them to the 1271socket module. On top of that, modules that have host names as function 1272parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names 1273(:mod:`httplib` then also transparently sends an IDNA hostname in the 1274:mailheader:`Host` field if it sends that field at all). 1275 1276.. _section 3.1: https://tools.ietf.org/html/rfc3490#section-3.1 1277 1278When receiving host names from the wire (such as in reverse name lookup), no 1279automatic conversion to Unicode is performed: Applications wishing to present 1280such host names to the user should decode them to Unicode. 1281 1282The module :mod:`encodings.idna` also implements the nameprep procedure, which 1283performs certain normalizations on host names, to achieve case-insensitivity of 1284international domain names, and to unify similar characters. The nameprep 1285functions can be used directly if desired. 1286 1287 1288.. function:: nameprep(label) 1289 1290 Return the nameprepped version of *label*. The implementation currently assumes 1291 query strings, so ``AllowUnassigned`` is true. 1292 1293 1294.. function:: ToASCII(label) 1295 1296 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is 1297 assumed to be false. 1298 1299 1300.. function:: ToUnicode(label) 1301 1302 Convert a label to Unicode, as specified in :rfc:`3490`. 1303 1304 1305:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature 1306------------------------------------------------------------- 1307 1308.. module:: encodings.utf_8_sig 1309 :synopsis: UTF-8 codec with BOM signature 1310.. moduleauthor:: Walter Dörwald 1311 1312.. versionadded:: 2.5 1313 1314This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded 1315BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this 1316is only done once (on the first write to the byte stream). For decoding an 1317optional UTF-8 encoded BOM at the start of the data will be skipped. 1318 1319