1.. highlight:: c 2 3.. _unicodeobjects: 4 5Unicode Objects and Codecs 6-------------------------- 7 8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 9.. sectionauthor:: Georg Brandl <georg@python.org> 10 11Unicode Objects 12^^^^^^^^^^^^^^^ 13 14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally 15use a variety of representations, in order to allow handling the complete range 16of Unicode characters while staying memory efficient. There are special cases 17for strings where all code points are below 128, 256, or 65536; otherwise, code 18points must be below 1114112 (which is the full Unicode range). 19 20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached 21in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated 22and inefficient; it should be avoided in performance- or memory-sensitive 23situations. 24 25Due to the transition between the old APIs and the new APIs, Unicode objects 26can internally be in two states depending on how they were created: 27 28* "canonical" Unicode objects are all objects created by a non-deprecated 29 Unicode API. They use the most efficient representation allowed by the 30 implementation. 31 32* "legacy" Unicode objects have been created through one of the deprecated 33 APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the 34 :c:type:`Py_UNICODE*` representation; you will have to call 35 :c:func:`PyUnicode_READY` on them before calling any other API. 36 37 38Unicode Type 39"""""""""""" 40 41These are the basic Unicode object types used for the Unicode implementation in 42Python: 43 44.. c:type:: Py_UCS4 45 Py_UCS2 46 Py_UCS1 47 48 These types are typedefs for unsigned integer types wide enough to contain 49 characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with 50 single Unicode characters, use :c:type:`Py_UCS4`. 51 52 .. versionadded:: 3.3 53 54 55.. c:type:: Py_UNICODE 56 57 This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type 58 depending on the platform. 59 60 .. versionchanged:: 3.3 61 In previous versions, this was a 16-bit type or a 32-bit type depending on 62 whether you selected a "narrow" or "wide" Unicode version of Python at 63 build time. 64 65 66.. c:type:: PyASCIIObject 67 PyCompactUnicodeObject 68 PyUnicodeObject 69 70 These subtypes of :c:type:`PyObject` represent a Python Unicode object. In 71 almost all cases, they shouldn't be used directly, since all API functions 72 that deal with Unicode objects take and return :c:type:`PyObject` pointers. 73 74 .. versionadded:: 3.3 75 76 77.. c:var:: PyTypeObject PyUnicode_Type 78 79 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 80 is exposed to Python code as ``str``. 81 82 83The following APIs are really C macros and can be used to do fast checks and to 84access internal read-only data of Unicode objects: 85 86.. c:function:: int PyUnicode_Check(PyObject *o) 87 88 Return true if the object *o* is a Unicode object or an instance of a Unicode 89 subtype. 90 91 92.. c:function:: int PyUnicode_CheckExact(PyObject *o) 93 94 Return true if the object *o* is a Unicode object, but not an instance of a 95 subtype. 96 97 98.. c:function:: int PyUnicode_READY(PyObject *o) 99 100 Ensure the string object *o* is in the "canonical" representation. This is 101 required before using any of the access macros described below. 102 103 .. XXX expand on when it is not required 104 105 Returns ``0`` on success and ``-1`` with an exception set on failure, which in 106 particular happens if memory allocation fails. 107 108 .. versionadded:: 3.3 109 110 111.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) 112 113 Return the length of the Unicode string, in code points. *o* has to be a 114 Unicode object in the "canonical" representation (not checked). 115 116 .. versionadded:: 3.3 117 118 119.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) 120 Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) 121 Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) 122 123 Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 124 integer types for direct character access. No checks are performed if the 125 canonical representation has the correct character size; use 126 :c:func:`PyUnicode_KIND` to select the right macro. Make sure 127 :c:func:`PyUnicode_READY` has been called before accessing this. 128 129 .. versionadded:: 3.3 130 131 132.. c:macro:: PyUnicode_WCHAR_KIND 133 PyUnicode_1BYTE_KIND 134 PyUnicode_2BYTE_KIND 135 PyUnicode_4BYTE_KIND 136 137 Return values of the :c:func:`PyUnicode_KIND` macro. 138 139 .. versionadded:: 3.3 140 141 142.. c:function:: int PyUnicode_KIND(PyObject *o) 143 144 Return one of the PyUnicode kind constants (see above) that indicate how many 145 bytes per character this Unicode object uses to store its data. *o* has to 146 be a Unicode object in the "canonical" representation (not checked). 147 148 .. XXX document "0" return value? 149 150 .. versionadded:: 3.3 151 152 153.. c:function:: void* PyUnicode_DATA(PyObject *o) 154 155 Return a void pointer to the raw Unicode buffer. *o* has to be a Unicode 156 object in the "canonical" representation (not checked). 157 158 .. versionadded:: 3.3 159 160 161.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \ 162 Py_UCS4 value) 163 164 Write into a canonical representation *data* (as obtained with 165 :c:func:`PyUnicode_DATA`). This macro does not do any sanity checks and is 166 intended for usage in loops. The caller should cache the *kind* value and 167 *data* pointer as obtained from other macro calls. *index* is the index in 168 the string (starts at 0) and *value* is the new code point value which should 169 be written to that location. 170 171 .. versionadded:: 3.3 172 173 174.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index) 175 176 Read a code point from a canonical representation *data* (as obtained with 177 :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. 178 179 .. versionadded:: 3.3 180 181 182.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) 183 184 Read a character from a Unicode object *o*, which must be in the "canonical" 185 representation. This is less efficient than :c:func:`PyUnicode_READ` if you 186 do multiple consecutive reads. 187 188 .. versionadded:: 3.3 189 190 191.. c:function:: PyUnicode_MAX_CHAR_VALUE(PyObject *o) 192 193 Return the maximum code point that is suitable for creating another string 194 based on *o*, which must be in the "canonical" representation. This is 195 always an approximation but more efficient than iterating over the string. 196 197 .. versionadded:: 3.3 198 199 200.. c:function:: int PyUnicode_ClearFreeList() 201 202 Clear the free list. Return the total number of freed items. 203 204 205.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 206 207 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 208 code units (this includes surrogate pairs as 2 units). *o* has to be a 209 Unicode object (not checked). 210 211 .. deprecated-removed:: 3.3 4.0 212 Part of the old-style Unicode API, please migrate to using 213 :c:func:`PyUnicode_GET_LENGTH`. 214 215 216.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 217 218 Return the size of the deprecated :c:type:`Py_UNICODE` representation in 219 bytes. *o* has to be a Unicode object (not checked). 220 221 .. deprecated-removed:: 3.3 4.0 222 Part of the old-style Unicode API, please migrate to using 223 :c:func:`PyUnicode_GET_LENGTH`. 224 225 226.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 227 const char* PyUnicode_AS_DATA(PyObject *o) 228 229 Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The 230 returned buffer is always terminated with an extra null code point. It 231 may also contain embedded null code points, which would cause the string 232 to be truncated when used in most C functions. The ``AS_DATA`` form 233 casts the pointer to :c:type:`const char *`. The *o* argument has to be 234 a Unicode object (not checked). 235 236 .. versionchanged:: 3.3 237 This macro is now inefficient -- because in many cases the 238 :c:type:`Py_UNICODE` representation does not exist and needs to be created 239 -- and can fail (return ``NULL`` with an exception set). Try to port the 240 code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use 241 :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. 242 243 .. deprecated-removed:: 3.3 4.0 244 Part of the old-style Unicode API, please migrate to using the 245 :c:func:`PyUnicode_nBYTE_DATA` family of macros. 246 247 248Unicode Character Properties 249"""""""""""""""""""""""""""" 250 251Unicode provides many different character properties. The most often needed ones 252are available through these macros which are mapped to C functions depending on 253the Python configuration. 254 255 256.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) 257 258 Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 259 260 261.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) 262 263 Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 264 265 266.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) 267 268 Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 269 270 271.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) 272 273 Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 274 275 276.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) 277 278 Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 279 280 281.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) 282 283 Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 284 285 286.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) 287 288 Return ``1`` or ``0`` depending on whether *ch* is a digit character. 289 290 291.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) 292 293 Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 294 295 296.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) 297 298 Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 299 300 301.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) 302 303 Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 304 305 306.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch) 307 308 Return ``1`` or ``0`` depending on whether *ch* is a printable character. 309 Nonprintable characters are those characters defined in the Unicode character 310 database as "Other" or "Separator", excepting the ASCII space (0x20) which is 311 considered printable. (Note that printable characters in this context are 312 those which should not be escaped when :func:`repr` is invoked on a string. 313 It has no bearing on the handling of strings written to :data:`sys.stdout` or 314 :data:`sys.stderr`.) 315 316 317These APIs can be used for fast direct character conversions: 318 319 320.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) 321 322 Return the character *ch* converted to lower case. 323 324 .. deprecated:: 3.3 325 This function uses simple case mappings. 326 327 328.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) 329 330 Return the character *ch* converted to upper case. 331 332 .. deprecated:: 3.3 333 This function uses simple case mappings. 334 335 336.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) 337 338 Return the character *ch* converted to title case. 339 340 .. deprecated:: 3.3 341 This function uses simple case mappings. 342 343 344.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) 345 346 Return the character *ch* converted to a decimal positive integer. Return 347 ``-1`` if this is not possible. This macro does not raise exceptions. 348 349 350.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) 351 352 Return the character *ch* converted to a single digit integer. Return ``-1`` if 353 this is not possible. This macro does not raise exceptions. 354 355 356.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) 357 358 Return the character *ch* converted to a double. Return ``-1.0`` if this is not 359 possible. This macro does not raise exceptions. 360 361 362These APIs can be used to work with surrogates: 363 364.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) 365 366 Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). 367 368.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) 369 370 Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``). 371 372.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) 373 374 Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). 375 376.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) 377 378 Join two surrogate characters and return a single Py_UCS4 value. 379 *high* and *low* are respectively the leading and trailing surrogates in a 380 surrogate pair. 381 382 383Creating and accessing Unicode strings 384"""""""""""""""""""""""""""""""""""""" 385 386To create Unicode objects and access their basic sequence properties, use these 387APIs: 388 389.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) 390 391 Create a new Unicode object. *maxchar* should be the true maximum code point 392 to be placed in the string. As an approximation, it can be rounded up to the 393 nearest value in the sequence 127, 255, 65535, 1114111. 394 395 This is the recommended way to allocate a new Unicode object. Objects 396 created using this function are not resizable. 397 398 .. versionadded:: 3.3 399 400 401.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ 402 Py_ssize_t size) 403 404 Create a new Unicode object with the given *kind* (possible values are 405 :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by 406 :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* 407 units of 1, 2 or 4 bytes per character, as given by the kind. 408 409 .. versionadded:: 3.3 410 411 412.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 413 414 Create a Unicode object from the char buffer *u*. The bytes will be 415 interpreted as being UTF-8 encoded. The buffer is copied into the new 416 object. If the buffer is not ``NULL``, the return value might be a shared 417 object, i.e. modification of the data is not allowed. 418 419 If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode` 420 with the buffer set to ``NULL``. This usage is deprecated in favor of 421 :c:func:`PyUnicode_New`. 422 423 424.. c:function:: PyObject *PyUnicode_FromString(const char *u) 425 426 Create a Unicode object from a UTF-8 encoded null-terminated char buffer 427 *u*. 428 429 430.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 431 432 Take a C :c:func:`printf`\ -style *format* string and a variable number of 433 arguments, calculate the size of the resulting Python Unicode string and return 434 a string with the values formatted into it. The variable arguments must be C 435 types and must correspond exactly to the format characters in the *format* 436 ASCII-encoded string. The following format characters are allowed: 437 438 .. % This should be exactly the same as the table in PyErr_Format. 439 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 440 .. % because not all compilers support the %z width modifier -- we fake it 441 .. % when necessary via interpolating PY_FORMAT_SIZE_T. 442 .. % Similar comments apply to the %ll width modifier and 443 444 .. tabularcolumns:: |l|l|L| 445 446 +-------------------+---------------------+----------------------------------+ 447 | Format Characters | Type | Comment | 448 +===================+=====================+==================================+ 449 | :attr:`%%` | *n/a* | The literal % character. | 450 +-------------------+---------------------+----------------------------------+ 451 | :attr:`%c` | int | A single character, | 452 | | | represented as a C int. | 453 +-------------------+---------------------+----------------------------------+ 454 | :attr:`%d` | int | Equivalent to | 455 | | | ``printf("%d")``. [1]_ | 456 +-------------------+---------------------+----------------------------------+ 457 | :attr:`%u` | unsigned int | Equivalent to | 458 | | | ``printf("%u")``. [1]_ | 459 +-------------------+---------------------+----------------------------------+ 460 | :attr:`%ld` | long | Equivalent to | 461 | | | ``printf("%ld")``. [1]_ | 462 +-------------------+---------------------+----------------------------------+ 463 | :attr:`%li` | long | Equivalent to | 464 | | | ``printf("%li")``. [1]_ | 465 +-------------------+---------------------+----------------------------------+ 466 | :attr:`%lu` | unsigned long | Equivalent to | 467 | | | ``printf("%lu")``. [1]_ | 468 +-------------------+---------------------+----------------------------------+ 469 | :attr:`%lld` | long long | Equivalent to | 470 | | | ``printf("%lld")``. [1]_ | 471 +-------------------+---------------------+----------------------------------+ 472 | :attr:`%lli` | long long | Equivalent to | 473 | | | ``printf("%lli")``. [1]_ | 474 +-------------------+---------------------+----------------------------------+ 475 | :attr:`%llu` | unsigned long long | Equivalent to | 476 | | | ``printf("%llu")``. [1]_ | 477 +-------------------+---------------------+----------------------------------+ 478 | :attr:`%zd` | Py_ssize_t | Equivalent to | 479 | | | ``printf("%zd")``. [1]_ | 480 +-------------------+---------------------+----------------------------------+ 481 | :attr:`%zi` | Py_ssize_t | Equivalent to | 482 | | | ``printf("%zi")``. [1]_ | 483 +-------------------+---------------------+----------------------------------+ 484 | :attr:`%zu` | size_t | Equivalent to | 485 | | | ``printf("%zu")``. [1]_ | 486 +-------------------+---------------------+----------------------------------+ 487 | :attr:`%i` | int | Equivalent to | 488 | | | ``printf("%i")``. [1]_ | 489 +-------------------+---------------------+----------------------------------+ 490 | :attr:`%x` | int | Equivalent to | 491 | | | ``printf("%x")``. [1]_ | 492 +-------------------+---------------------+----------------------------------+ 493 | :attr:`%s` | const char\* | A null-terminated C character | 494 | | | array. | 495 +-------------------+---------------------+----------------------------------+ 496 | :attr:`%p` | const void\* | The hex representation of a C | 497 | | | pointer. Mostly equivalent to | 498 | | | ``printf("%p")`` except that | 499 | | | it is guaranteed to start with | 500 | | | the literal ``0x`` regardless | 501 | | | of what the platform's | 502 | | | ``printf`` yields. | 503 +-------------------+---------------------+----------------------------------+ 504 | :attr:`%A` | PyObject\* | The result of calling | 505 | | | :func:`ascii`. | 506 +-------------------+---------------------+----------------------------------+ 507 | :attr:`%U` | PyObject\* | A Unicode object. | 508 +-------------------+---------------------+----------------------------------+ 509 | :attr:`%V` | PyObject\*, | A Unicode object (which may be | 510 | | const char\* | ``NULL``) and a null-terminated | 511 | | | C character array as a second | 512 | | | parameter (which will be used, | 513 | | | if the first parameter is | 514 | | | ``NULL``). | 515 +-------------------+---------------------+----------------------------------+ 516 | :attr:`%S` | PyObject\* | The result of calling | 517 | | | :c:func:`PyObject_Str`. | 518 +-------------------+---------------------+----------------------------------+ 519 | :attr:`%R` | PyObject\* | The result of calling | 520 | | | :c:func:`PyObject_Repr`. | 521 +-------------------+---------------------+----------------------------------+ 522 523 An unrecognized format character causes all the rest of the format string to be 524 copied as-is to the result string, and any extra arguments discarded. 525 526 .. note:: 527 The width formatter unit is number of characters rather than bytes. 528 The precision formatter unit is number of bytes for ``"%s"`` and 529 ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of 530 characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"`` 531 (if the ``PyObject*`` argument is not ``NULL``). 532 533 .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi, 534 zu, i, x): the 0-conversion flag has effect even when a precision is given. 535 536 .. versionchanged:: 3.2 537 Support for ``"%lld"`` and ``"%llu"`` added. 538 539 .. versionchanged:: 3.3 540 Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. 541 542 .. versionchanged:: 3.4 543 Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``, 544 ``"%V"``, ``"%S"``, ``"%R"`` added. 545 546 547.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 548 549 Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two 550 arguments. 551 552 553.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ 554 const char *encoding, const char *errors) 555 556 Decode an encoded object *obj* to a Unicode object. 557 558 :class:`bytes`, :class:`bytearray` and other 559 :term:`bytes-like objects <bytes-like object>` 560 are decoded according to the given *encoding* and using the error handling 561 defined by *errors*. Both can be ``NULL`` to have the interface use the default 562 values (see :ref:`builtincodecs` for details). 563 564 All other objects, including Unicode objects, cause a :exc:`TypeError` to be 565 set. 566 567 The API returns ``NULL`` if there was an error. The caller is responsible for 568 decref'ing the returned objects. 569 570 571.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) 572 573 Return the length of the Unicode object, in code points. 574 575 .. versionadded:: 3.3 576 577 578.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \ 579 Py_ssize_t to_start, \ 580 PyObject *from, \ 581 Py_ssize_t from_start, \ 582 Py_ssize_t how_many) 583 584 Copy characters from one Unicode object into another. This function performs 585 character conversion when necessary and falls back to :c:func:`memcpy` if 586 possible. Returns ``-1`` and sets an exception on error, otherwise returns 587 the number of copied characters. 588 589 .. versionadded:: 3.3 590 591 592.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ 593 Py_ssize_t length, Py_UCS4 fill_char) 594 595 Fill a string with a character: write *fill_char* into 596 ``unicode[start:start+length]``. 597 598 Fail if *fill_char* is bigger than the string maximum character, or if the 599 string has more than 1 reference. 600 601 Return the number of written character, or return ``-1`` and raise an 602 exception on error. 603 604 .. versionadded:: 3.3 605 606 607.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ 608 Py_UCS4 character) 609 610 Write a character to a string. The string must have been created through 611 :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, 612 the string must not be shared, or have been hashed yet. 613 614 This function checks that *unicode* is a Unicode object, that the index is 615 not out of bounds, and that the object can be modified safely (i.e. that it 616 its reference count is one). 617 618 .. versionadded:: 3.3 619 620 621.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) 622 623 Read a character from a string. This function checks that *unicode* is a 624 Unicode object and the index is not out of bounds, in contrast to the macro 625 version :c:func:`PyUnicode_READ_CHAR`. 626 627 .. versionadded:: 3.3 628 629 630.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ 631 Py_ssize_t end) 632 633 Return a substring of *str*, from character index *start* (included) to 634 character index *end* (excluded). Negative indices are not supported. 635 636 .. versionadded:: 3.3 637 638 639.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ 640 Py_ssize_t buflen, int copy_null) 641 642 Copy the string *u* into a UCS4 buffer, including a null character, if 643 *copy_null* is set. Returns ``NULL`` and sets an exception on error (in 644 particular, a :exc:`SystemError` if *buflen* is smaller than the length of 645 *u*). *buffer* is returned on success. 646 647 .. versionadded:: 3.3 648 649 650.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) 651 652 Copy the string *u* into a new UCS4 buffer that is allocated using 653 :c:func:`PyMem_Malloc`. If this fails, ``NULL`` is returned with a 654 :exc:`MemoryError` set. The returned buffer always has an extra 655 null code point appended. 656 657 .. versionadded:: 3.3 658 659 660Deprecated Py_UNICODE APIs 661"""""""""""""""""""""""""" 662 663.. deprecated-removed:: 3.3 4.0 664 665These API functions are deprecated with the implementation of :pep:`393`. 666Extension modules can continue using them, as they will not be removed in Python 6673.x, but need to be aware that their use can now cause performance and memory hits. 668 669 670.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 671 672 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 673 may be ``NULL`` which causes the contents to be undefined. It is the user's 674 responsibility to fill in the needed data. The buffer is copied into the new 675 object. 676 677 If the buffer is not ``NULL``, the return value might be a shared object. 678 Therefore, modification of the resulting Unicode object is only allowed when 679 *u* is ``NULL``. 680 681 If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the 682 string content has been filled before using any of the access macros such as 683 :c:func:`PyUnicode_KIND`. 684 685 Please migrate to using :c:func:`PyUnicode_FromKindAndData`, 686 :c:func:`PyUnicode_FromWideChar` or :c:func:`PyUnicode_New`. 687 688 689.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 690 691 Return a read-only pointer to the Unicode object's internal 692 :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the 693 :c:type:`Py_UNICODE*` representation of the object if it is not yet 694 available. The buffer is always terminated with an extra null code point. 695 Note that the resulting :c:type:`Py_UNICODE` string may also contain 696 embedded null code points, which would cause the string to be truncated when 697 used in most C functions. 698 699 Please migrate to using :c:func:`PyUnicode_AsUCS4`, 700 :c:func:`PyUnicode_AsWideChar`, :c:func:`PyUnicode_ReadChar` or similar new 701 APIs. 702 703 .. deprecated-removed:: 3.3 3.10 704 705 706.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size) 707 708 Create a Unicode object by replacing all decimal digits in 709 :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9 710 according to their decimal value. Return ``NULL`` if an exception occurs. 711 712 713.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) 714 715 Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` 716 array length (excluding the extra null terminator) in *size*. 717 Note that the resulting :c:type:`Py_UNICODE*` string 718 may contain embedded null code points, which would cause the string to be 719 truncated when used in most C functions. 720 721 .. versionadded:: 3.3 722 723 724.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode) 725 726 Create a copy of a Unicode string ending with a null code point. Return ``NULL`` 727 and raise a :exc:`MemoryError` exception on memory allocation failure, 728 otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free 729 the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may 730 contain embedded null code points, which would cause the string to be 731 truncated when used in most C functions. 732 733 .. versionadded:: 3.2 734 735 Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs. 736 737 738.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 739 740 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 741 code units (this includes surrogate pairs as 2 units). 742 743 Please migrate to using :c:func:`PyUnicode_GetLength`. 744 745 746.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 747 748 Copy an instance of a Unicode subtype to a new true Unicode object if 749 necessary. If *obj* is already a true Unicode object (not a subtype), 750 return the reference with incremented refcount. 751 752 Objects other than Unicode or its subtypes will cause a :exc:`TypeError`. 753 754 755Locale Encoding 756""""""""""""""" 757 758The current locale encoding can be used to decode text from the operating 759system. 760 761.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ 762 Py_ssize_t len, \ 763 const char *errors) 764 765 Decode a string from UTF-8 on Android and VxWorks, or from the current 766 locale encoding on other platforms. The supported 767 error handlers are ``"strict"`` and ``"surrogateescape"`` 768 (:pep:`383`). The decoder uses ``"strict"`` error handler if 769 *errors* is ``NULL``. *str* must end with a null character but 770 cannot contain embedded null characters. 771 772 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from 773 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 774 Python startup). 775 776 This function ignores the Python UTF-8 mode. 777 778 .. seealso:: 779 780 The :c:func:`Py_DecodeLocale` function. 781 782 .. versionadded:: 3.3 783 784 .. versionchanged:: 3.7 785 The function now also uses the current locale encoding for the 786 ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale` 787 was used for the ``surrogateescape``, and the current locale encoding was 788 used for ``strict``. 789 790 791.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) 792 793 Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string 794 length using :c:func:`strlen`. 795 796 .. versionadded:: 3.3 797 798 799.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) 800 801 Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current 802 locale encoding on other platforms. The 803 supported error handlers are ``"strict"`` and ``"surrogateescape"`` 804 (:pep:`383`). The encoder uses ``"strict"`` error handler if 805 *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot 806 contain embedded null characters. 807 808 Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to 809 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 810 Python startup). 811 812 This function ignores the Python UTF-8 mode. 813 814 .. seealso:: 815 816 The :c:func:`Py_EncodeLocale` function. 817 818 .. versionadded:: 3.3 819 820 .. versionchanged:: 3.7 821 The function now also uses the current locale encoding for the 822 ``surrogateescape`` error handler, except on Android. Previously, 823 :c:func:`Py_EncodeLocale` 824 was used for the ``surrogateescape``, and the current locale encoding was 825 used for ``strict``. 826 827 828File System Encoding 829"""""""""""""""""""" 830 831To encode and decode file names and other environment strings, 832:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and 833:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler 834(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during 835argument parsing, the ``"O&"`` converter should be used, passing 836:c:func:`PyUnicode_FSConverter` as the conversion function: 837 838.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result) 839 840 ParseTuple converter: encode :class:`str` objects -- obtained directly or 841 through the :class:`os.PathLike` interface -- to :class:`bytes` using 842 :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is. 843 *result* must be a :c:type:`PyBytesObject*` which must be released when it is 844 no longer used. 845 846 .. versionadded:: 3.1 847 848 .. versionchanged:: 3.6 849 Accepts a :term:`path-like object`. 850 851To decode file names to :class:`str` during argument parsing, the ``"O&"`` 852converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the 853conversion function: 854 855.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result) 856 857 ParseTuple converter: decode :class:`bytes` objects -- obtained either 858 directly or indirectly through the :class:`os.PathLike` interface -- to 859 :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` 860 objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which 861 must be released when it is no longer used. 862 863 .. versionadded:: 3.2 864 865 .. versionchanged:: 3.6 866 Accepts a :term:`path-like object`. 867 868 869.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) 870 871 Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the 872 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 873 874 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 875 locale encoding. 876 877 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 878 locale encoding and cannot be modified later. If you need to decode a string 879 from the current locale encoding, use 880 :c:func:`PyUnicode_DecodeLocaleAndSize`. 881 882 .. seealso:: 883 884 The :c:func:`Py_DecodeLocale` function. 885 886 .. versionchanged:: 3.6 887 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 888 889 890.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) 891 892 Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding` 893 and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 894 895 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 896 locale encoding. 897 898 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. 899 900 .. versionchanged:: 3.6 901 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 902 903 904.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) 905 906 Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the 907 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return 908 :class:`bytes`. Note that the resulting :class:`bytes` object may contain 909 null bytes. 910 911 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 912 locale encoding. 913 914 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 915 locale encoding and cannot be modified later. If you need to encode a string 916 to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`. 917 918 .. seealso:: 919 920 The :c:func:`Py_EncodeLocale` function. 921 922 .. versionadded:: 3.2 923 924 .. versionchanged:: 3.6 925 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 926 927wchar_t Support 928""""""""""""""" 929 930:c:type:`wchar_t` support for platforms which support it: 931 932.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 933 934 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*. 935 Passing ``-1`` as the *size* indicates that the function must itself compute the length, 936 using wcslen. 937 Return ``NULL`` on failure. 938 939 940.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size) 941 942 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most 943 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 944 null termination character). Return the number of :c:type:`wchar_t` characters 945 copied or ``-1`` in case of an error. Note that the resulting :c:type:`wchar_t*` 946 string may or may not be null-terminated. It is the responsibility of the caller 947 to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is 948 required by the application. Also, note that the :c:type:`wchar_t*` string 949 might contain null characters, which would cause the string to be truncated 950 when used with most C functions. 951 952 953.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) 954 955 Convert the Unicode object to a wide character string. The output string 956 always ends with a null character. If *size* is not ``NULL``, write the number 957 of wide characters (excluding the trailing null termination character) into 958 *\*size*. Note that the resulting :c:type:`wchar_t` string might contain 959 null characters, which would cause the string to be truncated when used with 960 most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string 961 contains null characters a :exc:`ValueError` is raised. 962 963 Returns a buffer allocated by :c:func:`PyMem_Alloc` (use 964 :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL`` 965 and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation 966 is failed. 967 968 .. versionadded:: 3.2 969 970 .. versionchanged:: 3.7 971 Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*` 972 string contains null characters. 973 974 975.. _builtincodecs: 976 977Built-in Codecs 978^^^^^^^^^^^^^^^ 979 980Python provides a set of built-in codecs which are written in C for speed. All of 981these codecs are directly usable via the following functions. 982 983Many of the following APIs take two arguments encoding and errors, and they 984have the same semantics as the ones of the built-in :func:`str` string object 985constructor. 986 987Setting encoding to ``NULL`` causes the default encoding to be used 988which is ASCII. The file system calls should use 989:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the 990variable :c:data:`Py_FileSystemDefaultEncoding` internally. This 991variable should be treated as read-only: on some systems, it will be a 992pointer to a static string, on others, it will change at run-time 993(such as when the application invokes setlocale). 994 995Error handling is set by errors which may also be set to ``NULL`` meaning to use 996the default handling defined for the codec. Default error handling for all 997built-in codecs is "strict" (:exc:`ValueError` is raised). 998 999The codecs all use a similar interface. Only deviation from the following 1000generic ones are documented for simplicity. 1001 1002 1003Generic Codecs 1004"""""""""""""" 1005 1006These are the generic codec APIs: 1007 1008 1009.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ 1010 const char *encoding, const char *errors) 1011 1012 Create a Unicode object by decoding *size* bytes of the encoded string *s*. 1013 *encoding* and *errors* have the same meaning as the parameters of the same name 1014 in the :func:`str` built-in function. The codec to be used is looked up 1015 using the Python codec registry. Return ``NULL`` if an exception was raised by 1016 the codec. 1017 1018 1019.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ 1020 const char *encoding, const char *errors) 1021 1022 Encode a Unicode object and return the result as Python bytes object. 1023 *encoding* and *errors* have the same meaning as the parameters of the same 1024 name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up 1025 using the Python codec registry. Return ``NULL`` if an exception was raised by 1026 the codec. 1027 1028 1029.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \ 1030 const char *encoding, const char *errors) 1031 1032 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python 1033 bytes object. *encoding* and *errors* have the same meaning as the 1034 parameters of the same name in the Unicode :meth:`~str.encode` method. The codec 1035 to be used is looked up using the Python codec registry. Return ``NULL`` if an 1036 exception was raised by the codec. 1037 1038 .. deprecated-removed:: 3.3 4.0 1039 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1040 :c:func:`PyUnicode_AsEncodedString`. 1041 1042 1043UTF-8 Codecs 1044"""""""""""" 1045 1046These are the UTF-8 codec APIs: 1047 1048 1049.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 1050 1051 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 1052 *s*. Return ``NULL`` if an exception was raised by the codec. 1053 1054 1055.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ 1056 const char *errors, Py_ssize_t *consumed) 1057 1058 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If 1059 *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be 1060 treated as an error. Those bytes will not be decoded and the number of bytes 1061 that have been decoded will be stored in *consumed*. 1062 1063 1064.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 1065 1066 Encode a Unicode object using UTF-8 and return the result as Python bytes 1067 object. Error handling is "strict". Return ``NULL`` if an exception was 1068 raised by the codec. 1069 1070 1071.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) 1072 1073 Return a pointer to the UTF-8 encoding of the Unicode object, and 1074 store the size of the encoded representation (in bytes) in *size*. The 1075 *size* argument can be ``NULL``; in this case no size will be stored. The 1076 returned buffer always has an extra null byte appended (not included in 1077 *size*), regardless of whether there are any other null code points. 1078 1079 In the case of an error, ``NULL`` is returned with an exception set and no 1080 *size* is stored. 1081 1082 This caches the UTF-8 representation of the string in the Unicode object, and 1083 subsequent calls will return a pointer to the same buffer. The caller is not 1084 responsible for deallocating the buffer. 1085 1086 .. versionadded:: 3.3 1087 1088 .. versionchanged:: 3.7 1089 The return type is now ``const char *`` rather of ``char *``. 1090 1091 1092.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode) 1093 1094 As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. 1095 1096 .. versionadded:: 3.3 1097 1098 .. versionchanged:: 3.7 1099 The return type is now ``const char *`` rather of ``char *``. 1100 1101 1102.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1103 1104 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and 1105 return a Python bytes object. Return ``NULL`` if an exception was raised by 1106 the codec. 1107 1108 .. deprecated-removed:: 3.3 4.0 1109 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1110 :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or 1111 :c:func:`PyUnicode_AsEncodedString`. 1112 1113 1114UTF-32 Codecs 1115""""""""""""" 1116 1117These are the UTF-32 codec APIs: 1118 1119 1120.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ 1121 const char *errors, int *byteorder) 1122 1123 Decode *size* bytes from a UTF-32 encoded buffer string and return the 1124 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1125 handling. It defaults to "strict". 1126 1127 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1128 order:: 1129 1130 *byteorder == -1: little endian 1131 *byteorder == 0: native order 1132 *byteorder == 1: big endian 1133 1134 If ``*byteorder`` is zero, and the first four bytes of the input data are a 1135 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1136 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1137 ``1``, any byte order mark is copied to the output. 1138 1139 After completion, *\*byteorder* is set to the current byte order at the end 1140 of input data. 1141 1142 If *byteorder* is ``NULL``, the codec starts in native order mode. 1143 1144 Return ``NULL`` if an exception was raised by the codec. 1145 1146 1147.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ 1148 const char *errors, int *byteorder, Py_ssize_t *consumed) 1149 1150 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If 1151 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 1152 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 1153 by four) as an error. Those bytes will not be decoded and the number of bytes 1154 that have been decoded will be stored in *consumed*. 1155 1156 1157.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 1158 1159 Return a Python byte string using the UTF-32 encoding in native byte 1160 order. The string always starts with a BOM mark. Error handling is "strict". 1161 Return ``NULL`` if an exception was raised by the codec. 1162 1163 1164.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \ 1165 const char *errors, int byteorder) 1166 1167 Return a Python bytes object holding the UTF-32 encoded value of the Unicode 1168 data in *s*. Output is written according to the following byte order:: 1169 1170 byteorder == -1: little endian 1171 byteorder == 0: native byte order (writes a BOM mark) 1172 byteorder == 1: big endian 1173 1174 If byteorder is ``0``, the output string will always start with the Unicode BOM 1175 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1176 1177 If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output 1178 as a single code point. 1179 1180 Return ``NULL`` if an exception was raised by the codec. 1181 1182 .. deprecated-removed:: 3.3 4.0 1183 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1184 :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`. 1185 1186 1187UTF-16 Codecs 1188""""""""""""" 1189 1190These are the UTF-16 codec APIs: 1191 1192 1193.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ 1194 const char *errors, int *byteorder) 1195 1196 Decode *size* bytes from a UTF-16 encoded buffer string and return the 1197 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1198 handling. It defaults to "strict". 1199 1200 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1201 order:: 1202 1203 *byteorder == -1: little endian 1204 *byteorder == 0: native order 1205 *byteorder == 1: big endian 1206 1207 If ``*byteorder`` is zero, and the first two bytes of the input data are a 1208 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1209 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1210 ``1``, any byte order mark is copied to the output (where it will result in 1211 either a ``\ufeff`` or a ``\ufffe`` character). 1212 1213 After completion, *\*byteorder* is set to the current byte order at the end 1214 of input data. 1215 1216 If *byteorder* is ``NULL``, the codec starts in native order mode. 1217 1218 Return ``NULL`` if an exception was raised by the codec. 1219 1220 1221.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ 1222 const char *errors, int *byteorder, Py_ssize_t *consumed) 1223 1224 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If 1225 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 1226 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 1227 split surrogate pair) as an error. Those bytes will not be decoded and the 1228 number of bytes that have been decoded will be stored in *consumed*. 1229 1230 1231.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 1232 1233 Return a Python byte string using the UTF-16 encoding in native byte 1234 order. The string always starts with a BOM mark. Error handling is "strict". 1235 Return ``NULL`` if an exception was raised by the codec. 1236 1237 1238.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \ 1239 const char *errors, int byteorder) 1240 1241 Return a Python bytes object holding the UTF-16 encoded value of the Unicode 1242 data in *s*. Output is written according to the following byte order:: 1243 1244 byteorder == -1: little endian 1245 byteorder == 0: native byte order (writes a BOM mark) 1246 byteorder == 1: big endian 1247 1248 If byteorder is ``0``, the output string will always start with the Unicode BOM 1249 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1250 1251 If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get 1252 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE` 1253 values is interpreted as a UCS-2 character. 1254 1255 Return ``NULL`` if an exception was raised by the codec. 1256 1257 .. deprecated-removed:: 3.3 4.0 1258 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1259 :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`. 1260 1261 1262UTF-7 Codecs 1263"""""""""""" 1264 1265These are the UTF-7 codec APIs: 1266 1267 1268.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 1269 1270 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 1271 *s*. Return ``NULL`` if an exception was raised by the codec. 1272 1273 1274.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ 1275 const char *errors, Py_ssize_t *consumed) 1276 1277 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`. If 1278 *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not 1279 be treated as an error. Those bytes will not be decoded and the number of 1280 bytes that have been decoded will be stored in *consumed*. 1281 1282 1283.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \ 1284 int base64SetO, int base64WhiteSpace, const char *errors) 1285 1286 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and 1287 return a Python bytes object. Return ``NULL`` if an exception was raised by 1288 the codec. 1289 1290 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise 1291 special meaning) will be encoded in base-64. If *base64WhiteSpace* is 1292 nonzero, whitespace will be encoded in base-64. Both are set to zero for the 1293 Python "utf-7" codec. 1294 1295 .. deprecated-removed:: 3.3 4.0 1296 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1297 :c:func:`PyUnicode_AsEncodedString`. 1298 1299 1300Unicode-Escape Codecs 1301""""""""""""""""""""" 1302 1303These are the "Unicode Escape" codec APIs: 1304 1305 1306.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ 1307 Py_ssize_t size, const char *errors) 1308 1309 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 1310 string *s*. Return ``NULL`` if an exception was raised by the codec. 1311 1312 1313.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 1314 1315 Encode a Unicode object using Unicode-Escape and return the result as a 1316 bytes object. Error handling is "strict". Return ``NULL`` if an exception was 1317 raised by the codec. 1318 1319 1320.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) 1321 1322 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and 1323 return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1324 1325 .. deprecated-removed:: 3.3 4.0 1326 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1327 :c:func:`PyUnicode_AsUnicodeEscapeString`. 1328 1329 1330Raw-Unicode-Escape Codecs 1331""""""""""""""""""""""""" 1332 1333These are the "Raw Unicode Escape" codec APIs: 1334 1335 1336.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ 1337 Py_ssize_t size, const char *errors) 1338 1339 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 1340 encoded string *s*. Return ``NULL`` if an exception was raised by the codec. 1341 1342 1343.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 1344 1345 Encode a Unicode object using Raw-Unicode-Escape and return the result as 1346 a bytes object. Error handling is "strict". Return ``NULL`` if an exception 1347 was raised by the codec. 1348 1349 1350.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \ 1351 Py_ssize_t size) 1352 1353 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape 1354 and return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1355 1356 .. deprecated-removed:: 3.3 4.0 1357 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1358 :c:func:`PyUnicode_AsRawUnicodeEscapeString` or 1359 :c:func:`PyUnicode_AsEncodedString`. 1360 1361 1362Latin-1 Codecs 1363"""""""""""""" 1364 1365These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 1366ordinals and only these are accepted by the codecs during encoding. 1367 1368 1369.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 1370 1371 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 1372 *s*. Return ``NULL`` if an exception was raised by the codec. 1373 1374 1375.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 1376 1377 Encode a Unicode object using Latin-1 and return the result as Python bytes 1378 object. Error handling is "strict". Return ``NULL`` if an exception was 1379 raised by the codec. 1380 1381 1382.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1383 1384 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and 1385 return a Python bytes object. Return ``NULL`` if an exception was raised by 1386 the codec. 1387 1388 .. deprecated-removed:: 3.3 4.0 1389 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1390 :c:func:`PyUnicode_AsLatin1String` or 1391 :c:func:`PyUnicode_AsEncodedString`. 1392 1393 1394ASCII Codecs 1395"""""""""""" 1396 1397These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 1398codes generate errors. 1399 1400 1401.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 1402 1403 Create a Unicode object by decoding *size* bytes of the ASCII encoded string 1404 *s*. Return ``NULL`` if an exception was raised by the codec. 1405 1406 1407.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 1408 1409 Encode a Unicode object using ASCII and return the result as Python bytes 1410 object. Error handling is "strict". Return ``NULL`` if an exception was 1411 raised by the codec. 1412 1413 1414.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1415 1416 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and 1417 return a Python bytes object. Return ``NULL`` if an exception was raised by 1418 the codec. 1419 1420 .. deprecated-removed:: 3.3 4.0 1421 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1422 :c:func:`PyUnicode_AsASCIIString` or 1423 :c:func:`PyUnicode_AsEncodedString`. 1424 1425 1426Character Map Codecs 1427"""""""""""""""""""" 1428 1429This codec is special in that it can be used to implement many different codecs 1430(and this is in fact what was done to obtain most of the standard codecs 1431included in the :mod:`encodings` package). The codec uses mapping to encode and 1432decode characters. The mapping objects provided must support the 1433:meth:`__getitem__` mapping interface; dictionaries and sequences work well. 1434 1435These are the mapping codec APIs: 1436 1437.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \ 1438 PyObject *mapping, const char *errors) 1439 1440 Create a Unicode object by decoding *size* bytes of the encoded string *s* 1441 using the given *mapping* object. Return ``NULL`` if an exception was raised 1442 by the codec. 1443 1444 If *mapping* is ``NULL``, Latin-1 decoding will be applied. Else 1445 *mapping* must map bytes ordinals (integers in the range from 0 to 255) 1446 to Unicode strings, integers (which are then interpreted as Unicode 1447 ordinals) or ``None``. Unmapped data bytes -- ones which cause a 1448 :exc:`LookupError`, as well as ones which get mapped to ``None``, 1449 ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause 1450 an error. 1451 1452 1453.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 1454 1455 Encode a Unicode object using the given *mapping* object and return the 1456 result as a bytes object. Error handling is "strict". Return ``NULL`` if an 1457 exception was raised by the codec. 1458 1459 The *mapping* object must map Unicode ordinal integers to bytes objects, 1460 integers in the range from 0 to 255 or ``None``. Unmapped character 1461 ordinals (ones which cause a :exc:`LookupError`) as well as mapped to 1462 ``None`` are treated as "undefined mapping" and cause an error. 1463 1464 1465.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1466 PyObject *mapping, const char *errors) 1467 1468 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given 1469 *mapping* object and return the result as a bytes object. Return ``NULL`` if 1470 an exception was raised by the codec. 1471 1472 .. deprecated-removed:: 3.3 4.0 1473 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1474 :c:func:`PyUnicode_AsCharmapString` or 1475 :c:func:`PyUnicode_AsEncodedString`. 1476 1477 1478The following codec API is special in that maps Unicode to Unicode. 1479 1480.. c:function:: PyObject* PyUnicode_Translate(PyObject *unicode, \ 1481 PyObject *mapping, const char *errors) 1482 1483 Translate a Unicode object using the given *mapping* object and return the 1484 resulting Unicode object. Return ``NULL`` if an exception was raised by the 1485 codec. 1486 1487 The *mapping* object must map Unicode ordinal integers to Unicode strings, 1488 integers (which are then interpreted as Unicode ordinals) or ``None`` 1489 (causing deletion of the character). Unmapped character ordinals (ones 1490 which cause a :exc:`LookupError`) are left untouched and are copied as-is. 1491 1492 1493.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1494 PyObject *mapping, const char *errors) 1495 1496 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a 1497 character *mapping* table to it and return the resulting Unicode object. 1498 Return ``NULL`` when an exception was raised by the codec. 1499 1500 .. deprecated-removed:: 3.3 4.0 1501 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1502 :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API 1503 <codec-registry>` 1504 1505 1506MBCS codecs for Windows 1507""""""""""""""""""""""" 1508 1509These are the MBCS codec APIs. They are currently only available on Windows and 1510use the Win32 MBCS converters to implement the conversions. Note that MBCS (or 1511DBCS) is a class of encodings, not just one. The target encoding is defined by 1512the user settings on the machine running the codec. 1513 1514.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 1515 1516 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 1517 Return ``NULL`` if an exception was raised by the codec. 1518 1519 1520.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \ 1521 const char *errors, Py_ssize_t *consumed) 1522 1523 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If 1524 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 1525 trailing lead byte and the number of bytes that have been decoded will be stored 1526 in *consumed*. 1527 1528 1529.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 1530 1531 Encode a Unicode object using MBCS and return the result as Python bytes 1532 object. Error handling is "strict". Return ``NULL`` if an exception was 1533 raised by the codec. 1534 1535 1536.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) 1537 1538 Encode the Unicode object using the specified code page and return a Python 1539 bytes object. Return ``NULL`` if an exception was raised by the codec. Use 1540 :c:data:`CP_ACP` code page to get the MBCS encoder. 1541 1542 .. versionadded:: 3.3 1543 1544 1545.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1546 1547 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return 1548 a Python bytes object. Return ``NULL`` if an exception was raised by the 1549 codec. 1550 1551 .. deprecated-removed:: 3.3 4.0 1552 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1553 :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or 1554 :c:func:`PyUnicode_AsEncodedString`. 1555 1556 1557Methods & Slots 1558""""""""""""""" 1559 1560 1561.. _unicodemethodsandslots: 1562 1563Methods and Slot Functions 1564^^^^^^^^^^^^^^^^^^^^^^^^^^ 1565 1566The following APIs are capable of handling Unicode objects and strings on input 1567(we refer to them as strings in the descriptions) and return Unicode objects or 1568integers as appropriate. 1569 1570They all return ``NULL`` or ``-1`` if an exception occurs. 1571 1572 1573.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 1574 1575 Concat two strings giving a new Unicode string. 1576 1577 1578.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 1579 1580 Split a string giving a list of Unicode strings. If *sep* is ``NULL``, splitting 1581 will be done at all whitespace substrings. Otherwise, splits occur at the given 1582 separator. At most *maxsplit* splits will be done. If negative, no limit is 1583 set. Separators are not included in the resulting list. 1584 1585 1586.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 1587 1588 Split a Unicode string at line breaks, returning a list of Unicode strings. 1589 CRLF is considered to be one line break. If *keepend* is ``0``, the Line break 1590 characters are not included in the resulting strings. 1591 1592 1593.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, \ 1594 const char *errors) 1595 1596 Translate a string by applying a character mapping table to it and return the 1597 resulting Unicode object. 1598 1599 The mapping table must map Unicode ordinal integers to Unicode ordinal integers 1600 or ``None`` (causing deletion of the character). 1601 1602 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 1603 and sequences work well. Unmapped character ordinals (ones which cause a 1604 :exc:`LookupError`) are left untouched and are copied as-is. 1605 1606 *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to 1607 use the default error handling. 1608 1609 1610.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 1611 1612 Join a sequence of strings using the given *separator* and return the resulting 1613 Unicode string. 1614 1615 1616.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ 1617 Py_ssize_t start, Py_ssize_t end, int direction) 1618 1619 Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 1620 (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 1621 ``0`` otherwise. Return ``-1`` if an error occurred. 1622 1623 1624.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ 1625 Py_ssize_t start, Py_ssize_t end, int direction) 1626 1627 Return the first position of *substr* in ``str[start:end]`` using the given 1628 *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 1629 backward search). The return value is the index of the first match; a value of 1630 ``-1`` indicates that no match was found, and ``-2`` indicates that an error 1631 occurred and an exception has been set. 1632 1633 1634.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ 1635 Py_ssize_t start, Py_ssize_t end, int direction) 1636 1637 Return the first position of the character *ch* in ``str[start:end]`` using 1638 the given *direction* (*direction* == ``1`` means to do a forward search, 1639 *direction* == ``-1`` a backward search). The return value is the index of the 1640 first match; a value of ``-1`` indicates that no match was found, and ``-2`` 1641 indicates that an error occurred and an exception has been set. 1642 1643 .. versionadded:: 3.3 1644 1645 .. versionchanged:: 3.7 1646 *start* and *end* are now adjusted to behave like ``str[start:end]``. 1647 1648 1649.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ 1650 Py_ssize_t start, Py_ssize_t end) 1651 1652 Return the number of non-overlapping occurrences of *substr* in 1653 ``str[start:end]``. Return ``-1`` if an error occurred. 1654 1655 1656.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ 1657 PyObject *replstr, Py_ssize_t maxcount) 1658 1659 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 1660 return the resulting Unicode object. *maxcount* == ``-1`` means replace all 1661 occurrences. 1662 1663 1664.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 1665 1666 Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 1667 respectively. 1668 1669 This function returns ``-1`` upon failure, so one should call 1670 :c:func:`PyErr_Occurred` to check for errors. 1671 1672 1673.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string) 1674 1675 Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less 1676 than, equal, and greater than, respectively. It is best to pass only 1677 ASCII-encoded strings, but the function interprets the input string as 1678 ISO-8859-1 if it contains non-ASCII characters. 1679 1680 This function does not raise exceptions. 1681 1682 1683.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 1684 1685 Rich compare two Unicode strings and return one of the following: 1686 1687 * ``NULL`` in case an exception was raised 1688 * :const:`Py_True` or :const:`Py_False` for successful comparisons 1689 * :const:`Py_NotImplemented` in case the type combination is unknown 1690 1691 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 1692 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 1693 1694 1695.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 1696 1697 Return a new string object from *format* and *args*; this is analogous to 1698 ``format % args``. 1699 1700 1701.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 1702 1703 Check whether *element* is contained in *container* and return true or false 1704 accordingly. 1705 1706 *element* has to coerce to a one element Unicode string. ``-1`` is returned 1707 if there was an error. 1708 1709 1710.. c:function:: void PyUnicode_InternInPlace(PyObject **string) 1711 1712 Intern the argument *\*string* in place. The argument must be the address of a 1713 pointer variable pointing to a Python Unicode string object. If there is an 1714 existing interned string that is the same as *\*string*, it sets *\*string* to 1715 it (decrementing the reference count of the old string object and incrementing 1716 the reference count of the interned string object), otherwise it leaves 1717 *\*string* alone and interns it (incrementing its reference count). 1718 (Clarification: even though there is a lot of talk about reference counts, think 1719 of this function as reference-count-neutral; you own the object after the call 1720 if and only if you owned it before the call.) 1721 1722 1723.. c:function:: PyObject* PyUnicode_InternFromString(const char *v) 1724 1725 A combination of :c:func:`PyUnicode_FromString` and 1726 :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string 1727 object that has been interned, or a new ("owned") reference to an earlier 1728 interned string object with the same value. 1729