1.. highlight:: c 2 3.. _unicodeobjects: 4 5Unicode Objects and Codecs 6-------------------------- 7 8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 9.. sectionauthor:: Georg Brandl <georg@python.org> 10 11Unicode Objects 12^^^^^^^^^^^^^^^ 13 14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally 15use a variety of representations, in order to allow handling the complete range 16of Unicode characters while staying memory efficient. There are special cases 17for strings where all code points are below 128, 256, or 65536; otherwise, code 18points must be below 1114112 (which is the full Unicode range). 19 20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached 21in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated 22and inefficient. 23 24Due to the transition between the old APIs and the new APIs, Unicode objects 25can internally be in two states depending on how they were created: 26 27* "canonical" Unicode objects are all objects created by a non-deprecated 28 Unicode API. They use the most efficient representation allowed by the 29 implementation. 30 31* "legacy" Unicode objects have been created through one of the deprecated 32 APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the 33 :c:type:`Py_UNICODE*` representation; you will have to call 34 :c:func:`PyUnicode_READY` on them before calling any other API. 35 36.. note:: 37 The "legacy" Unicode object will be removed in Python 3.12 with deprecated 38 APIs. All Unicode objects will be "canonical" since then. See :pep:`623` 39 for more information. 40 41 42Unicode Type 43"""""""""""" 44 45These are the basic Unicode object types used for the Unicode implementation in 46Python: 47 48.. c:type:: Py_UCS4 49 Py_UCS2 50 Py_UCS1 51 52 These types are typedefs for unsigned integer types wide enough to contain 53 characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with 54 single Unicode characters, use :c:type:`Py_UCS4`. 55 56 .. versionadded:: 3.3 57 58 59.. c:type:: Py_UNICODE 60 61 This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type 62 depending on the platform. 63 64 .. versionchanged:: 3.3 65 In previous versions, this was a 16-bit type or a 32-bit type depending on 66 whether you selected a "narrow" or "wide" Unicode version of Python at 67 build time. 68 69 70.. c:type:: PyASCIIObject 71 PyCompactUnicodeObject 72 PyUnicodeObject 73 74 These subtypes of :c:type:`PyObject` represent a Python Unicode object. In 75 almost all cases, they shouldn't be used directly, since all API functions 76 that deal with Unicode objects take and return :c:type:`PyObject` pointers. 77 78 .. versionadded:: 3.3 79 80 81.. c:var:: PyTypeObject PyUnicode_Type 82 83 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 84 is exposed to Python code as ``str``. 85 86 87The following APIs are really C macros and can be used to do fast checks and to 88access internal read-only data of Unicode objects: 89 90.. c:function:: int PyUnicode_Check(PyObject *o) 91 92 Return true if the object *o* is a Unicode object or an instance of a Unicode 93 subtype. This function always succeeds. 94 95 96.. c:function:: int PyUnicode_CheckExact(PyObject *o) 97 98 Return true if the object *o* is a Unicode object, but not an instance of a 99 subtype. This function always succeeds. 100 101 102.. c:function:: int PyUnicode_READY(PyObject *o) 103 104 Ensure the string object *o* is in the "canonical" representation. This is 105 required before using any of the access macros described below. 106 107 .. XXX expand on when it is not required 108 109 Returns ``0`` on success and ``-1`` with an exception set on failure, which in 110 particular happens if memory allocation fails. 111 112 .. versionadded:: 3.3 113 114 .. deprecated-removed:: 3.10 3.12 115 This API will be removed with :c:func:`PyUnicode_FromUnicode`. 116 117 118.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) 119 120 Return the length of the Unicode string, in code points. *o* has to be a 121 Unicode object in the "canonical" representation (not checked). 122 123 .. versionadded:: 3.3 124 125 126.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) 127 Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) 128 Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) 129 130 Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 131 integer types for direct character access. No checks are performed if the 132 canonical representation has the correct character size; use 133 :c:func:`PyUnicode_KIND` to select the right macro. Make sure 134 :c:func:`PyUnicode_READY` has been called before accessing this. 135 136 .. versionadded:: 3.3 137 138 139.. c:macro:: PyUnicode_WCHAR_KIND 140 PyUnicode_1BYTE_KIND 141 PyUnicode_2BYTE_KIND 142 PyUnicode_4BYTE_KIND 143 144 Return values of the :c:func:`PyUnicode_KIND` macro. 145 146 .. versionadded:: 3.3 147 148 .. deprecated-removed:: 3.10 3.12 149 ``PyUnicode_WCHAR_KIND`` is deprecated. 150 151 152.. c:function:: unsigned int PyUnicode_KIND(PyObject *o) 153 154 Return one of the PyUnicode kind constants (see above) that indicate how many 155 bytes per character this Unicode object uses to store its data. *o* has to 156 be a Unicode object in the "canonical" representation (not checked). 157 158 .. XXX document "0" return value? 159 160 .. versionadded:: 3.3 161 162 163.. c:function:: void* PyUnicode_DATA(PyObject *o) 164 165 Return a void pointer to the raw Unicode buffer. *o* has to be a Unicode 166 object in the "canonical" representation (not checked). 167 168 .. versionadded:: 3.3 169 170 171.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \ 172 Py_UCS4 value) 173 174 Write into a canonical representation *data* (as obtained with 175 :c:func:`PyUnicode_DATA`). This macro does not do any sanity checks and is 176 intended for usage in loops. The caller should cache the *kind* value and 177 *data* pointer as obtained from other macro calls. *index* is the index in 178 the string (starts at 0) and *value* is the new code point value which should 179 be written to that location. 180 181 .. versionadded:: 3.3 182 183 184.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index) 185 186 Read a code point from a canonical representation *data* (as obtained with 187 :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. 188 189 .. versionadded:: 3.3 190 191 192.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) 193 194 Read a character from a Unicode object *o*, which must be in the "canonical" 195 representation. This is less efficient than :c:func:`PyUnicode_READ` if you 196 do multiple consecutive reads. 197 198 .. versionadded:: 3.3 199 200 201.. c:macro:: PyUnicode_MAX_CHAR_VALUE(o) 202 203 Return the maximum code point that is suitable for creating another string 204 based on *o*, which must be in the "canonical" representation. This is 205 always an approximation but more efficient than iterating over the string. 206 207 .. versionadded:: 3.3 208 209 210.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 211 212 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 213 code units (this includes surrogate pairs as 2 units). *o* has to be a 214 Unicode object (not checked). 215 216 .. deprecated-removed:: 3.3 3.12 217 Part of the old-style Unicode API, please migrate to using 218 :c:func:`PyUnicode_GET_LENGTH`. 219 220 221.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 222 223 Return the size of the deprecated :c:type:`Py_UNICODE` representation in 224 bytes. *o* has to be a Unicode object (not checked). 225 226 .. deprecated-removed:: 3.3 3.12 227 Part of the old-style Unicode API, please migrate to using 228 :c:func:`PyUnicode_GET_LENGTH`. 229 230 231.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 232 const char* PyUnicode_AS_DATA(PyObject *o) 233 234 Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The 235 returned buffer is always terminated with an extra null code point. It 236 may also contain embedded null code points, which would cause the string 237 to be truncated when used in most C functions. The ``AS_DATA`` form 238 casts the pointer to :c:type:`const char *`. The *o* argument has to be 239 a Unicode object (not checked). 240 241 .. versionchanged:: 3.3 242 This macro is now inefficient -- because in many cases the 243 :c:type:`Py_UNICODE` representation does not exist and needs to be created 244 -- and can fail (return ``NULL`` with an exception set). Try to port the 245 code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use 246 :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. 247 248 .. deprecated-removed:: 3.3 3.12 249 Part of the old-style Unicode API, please migrate to using the 250 :c:func:`PyUnicode_nBYTE_DATA` family of macros. 251 252 253.. c:function:: int PyUnicode_IsIdentifier(PyObject *o) 254 255 Return ``1`` if the string is a valid identifier according to the language 256 definition, section :ref:`identifiers`. Return ``0`` otherwise. 257 258 .. versionchanged:: 3.9 259 The function does not call :c:func:`Py_FatalError` anymore if the string 260 is not ready. 261 262 263Unicode Character Properties 264"""""""""""""""""""""""""""" 265 266Unicode provides many different character properties. The most often needed ones 267are available through these macros which are mapped to C functions depending on 268the Python configuration. 269 270 271.. c:function:: int Py_UNICODE_ISSPACE(Py_UCS4 ch) 272 273 Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 274 275 276.. c:function:: int Py_UNICODE_ISLOWER(Py_UCS4 ch) 277 278 Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 279 280 281.. c:function:: int Py_UNICODE_ISUPPER(Py_UCS4 ch) 282 283 Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 284 285 286.. c:function:: int Py_UNICODE_ISTITLE(Py_UCS4 ch) 287 288 Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 289 290 291.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UCS4 ch) 292 293 Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 294 295 296.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UCS4 ch) 297 298 Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 299 300 301.. c:function:: int Py_UNICODE_ISDIGIT(Py_UCS4 ch) 302 303 Return ``1`` or ``0`` depending on whether *ch* is a digit character. 304 305 306.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UCS4 ch) 307 308 Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 309 310 311.. c:function:: int Py_UNICODE_ISALPHA(Py_UCS4 ch) 312 313 Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 314 315 316.. c:function:: int Py_UNICODE_ISALNUM(Py_UCS4 ch) 317 318 Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 319 320 321.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UCS4 ch) 322 323 Return ``1`` or ``0`` depending on whether *ch* is a printable character. 324 Nonprintable characters are those characters defined in the Unicode character 325 database as "Other" or "Separator", excepting the ASCII space (0x20) which is 326 considered printable. (Note that printable characters in this context are 327 those which should not be escaped when :func:`repr` is invoked on a string. 328 It has no bearing on the handling of strings written to :data:`sys.stdout` or 329 :data:`sys.stderr`.) 330 331 332These APIs can be used for fast direct character conversions: 333 334 335.. c:function:: Py_UCS4 Py_UNICODE_TOLOWER(Py_UCS4 ch) 336 337 Return the character *ch* converted to lower case. 338 339 .. deprecated:: 3.3 340 This function uses simple case mappings. 341 342 343.. c:function:: Py_UCS4 Py_UNICODE_TOUPPER(Py_UCS4 ch) 344 345 Return the character *ch* converted to upper case. 346 347 .. deprecated:: 3.3 348 This function uses simple case mappings. 349 350 351.. c:function:: Py_UCS4 Py_UNICODE_TOTITLE(Py_UCS4 ch) 352 353 Return the character *ch* converted to title case. 354 355 .. deprecated:: 3.3 356 This function uses simple case mappings. 357 358 359.. c:function:: int Py_UNICODE_TODECIMAL(Py_UCS4 ch) 360 361 Return the character *ch* converted to a decimal positive integer. Return 362 ``-1`` if this is not possible. This macro does not raise exceptions. 363 364 365.. c:function:: int Py_UNICODE_TODIGIT(Py_UCS4 ch) 366 367 Return the character *ch* converted to a single digit integer. Return ``-1`` if 368 this is not possible. This macro does not raise exceptions. 369 370 371.. c:function:: double Py_UNICODE_TONUMERIC(Py_UCS4 ch) 372 373 Return the character *ch* converted to a double. Return ``-1.0`` if this is not 374 possible. This macro does not raise exceptions. 375 376 377These APIs can be used to work with surrogates: 378 379.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) 380 381 Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). 382 383.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) 384 385 Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``). 386 387.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) 388 389 Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). 390 391.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) 392 393 Join two surrogate characters and return a single Py_UCS4 value. 394 *high* and *low* are respectively the leading and trailing surrogates in a 395 surrogate pair. 396 397 398Creating and accessing Unicode strings 399"""""""""""""""""""""""""""""""""""""" 400 401To create Unicode objects and access their basic sequence properties, use these 402APIs: 403 404.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) 405 406 Create a new Unicode object. *maxchar* should be the true maximum code point 407 to be placed in the string. As an approximation, it can be rounded up to the 408 nearest value in the sequence 127, 255, 65535, 1114111. 409 410 This is the recommended way to allocate a new Unicode object. Objects 411 created using this function are not resizable. 412 413 .. versionadded:: 3.3 414 415 416.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ 417 Py_ssize_t size) 418 419 Create a new Unicode object with the given *kind* (possible values are 420 :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by 421 :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* 422 units of 1, 2 or 4 bytes per character, as given by the kind. 423 424 .. versionadded:: 3.3 425 426 427.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 428 429 Create a Unicode object from the char buffer *u*. The bytes will be 430 interpreted as being UTF-8 encoded. The buffer is copied into the new 431 object. If the buffer is not ``NULL``, the return value might be a shared 432 object, i.e. modification of the data is not allowed. 433 434 If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode` 435 with the buffer set to ``NULL``. This usage is deprecated in favor of 436 :c:func:`PyUnicode_New`, and will be removed in Python 3.12. 437 438 439.. c:function:: PyObject *PyUnicode_FromString(const char *u) 440 441 Create a Unicode object from a UTF-8 encoded null-terminated char buffer 442 *u*. 443 444 445.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 446 447 Take a C :c:func:`printf`\ -style *format* string and a variable number of 448 arguments, calculate the size of the resulting Python Unicode string and return 449 a string with the values formatted into it. The variable arguments must be C 450 types and must correspond exactly to the format characters in the *format* 451 ASCII-encoded string. The following format characters are allowed: 452 453 .. % This should be exactly the same as the table in PyErr_Format. 454 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 455 .. % because not all compilers support the %z width modifier -- we fake it 456 .. % when necessary via interpolating PY_FORMAT_SIZE_T. 457 .. % Similar comments apply to the %ll width modifier and 458 459 .. tabularcolumns:: |l|l|L| 460 461 +-------------------+---------------------+----------------------------------+ 462 | Format Characters | Type | Comment | 463 +===================+=====================+==================================+ 464 | :attr:`%%` | *n/a* | The literal % character. | 465 +-------------------+---------------------+----------------------------------+ 466 | :attr:`%c` | int | A single character, | 467 | | | represented as a C int. | 468 +-------------------+---------------------+----------------------------------+ 469 | :attr:`%d` | int | Equivalent to | 470 | | | ``printf("%d")``. [1]_ | 471 +-------------------+---------------------+----------------------------------+ 472 | :attr:`%u` | unsigned int | Equivalent to | 473 | | | ``printf("%u")``. [1]_ | 474 +-------------------+---------------------+----------------------------------+ 475 | :attr:`%ld` | long | Equivalent to | 476 | | | ``printf("%ld")``. [1]_ | 477 +-------------------+---------------------+----------------------------------+ 478 | :attr:`%li` | long | Equivalent to | 479 | | | ``printf("%li")``. [1]_ | 480 +-------------------+---------------------+----------------------------------+ 481 | :attr:`%lu` | unsigned long | Equivalent to | 482 | | | ``printf("%lu")``. [1]_ | 483 +-------------------+---------------------+----------------------------------+ 484 | :attr:`%lld` | long long | Equivalent to | 485 | | | ``printf("%lld")``. [1]_ | 486 +-------------------+---------------------+----------------------------------+ 487 | :attr:`%lli` | long long | Equivalent to | 488 | | | ``printf("%lli")``. [1]_ | 489 +-------------------+---------------------+----------------------------------+ 490 | :attr:`%llu` | unsigned long long | Equivalent to | 491 | | | ``printf("%llu")``. [1]_ | 492 +-------------------+---------------------+----------------------------------+ 493 | :attr:`%zd` | Py_ssize_t | Equivalent to | 494 | | | ``printf("%zd")``. [1]_ | 495 +-------------------+---------------------+----------------------------------+ 496 | :attr:`%zi` | Py_ssize_t | Equivalent to | 497 | | | ``printf("%zi")``. [1]_ | 498 +-------------------+---------------------+----------------------------------+ 499 | :attr:`%zu` | size_t | Equivalent to | 500 | | | ``printf("%zu")``. [1]_ | 501 +-------------------+---------------------+----------------------------------+ 502 | :attr:`%i` | int | Equivalent to | 503 | | | ``printf("%i")``. [1]_ | 504 +-------------------+---------------------+----------------------------------+ 505 | :attr:`%x` | int | Equivalent to | 506 | | | ``printf("%x")``. [1]_ | 507 +-------------------+---------------------+----------------------------------+ 508 | :attr:`%s` | const char\* | A null-terminated C character | 509 | | | array. | 510 +-------------------+---------------------+----------------------------------+ 511 | :attr:`%p` | const void\* | The hex representation of a C | 512 | | | pointer. Mostly equivalent to | 513 | | | ``printf("%p")`` except that | 514 | | | it is guaranteed to start with | 515 | | | the literal ``0x`` regardless | 516 | | | of what the platform's | 517 | | | ``printf`` yields. | 518 +-------------------+---------------------+----------------------------------+ 519 | :attr:`%A` | PyObject\* | The result of calling | 520 | | | :func:`ascii`. | 521 +-------------------+---------------------+----------------------------------+ 522 | :attr:`%U` | PyObject\* | A Unicode object. | 523 +-------------------+---------------------+----------------------------------+ 524 | :attr:`%V` | PyObject\*, | A Unicode object (which may be | 525 | | const char\* | ``NULL``) and a null-terminated | 526 | | | C character array as a second | 527 | | | parameter (which will be used, | 528 | | | if the first parameter is | 529 | | | ``NULL``). | 530 +-------------------+---------------------+----------------------------------+ 531 | :attr:`%S` | PyObject\* | The result of calling | 532 | | | :c:func:`PyObject_Str`. | 533 +-------------------+---------------------+----------------------------------+ 534 | :attr:`%R` | PyObject\* | The result of calling | 535 | | | :c:func:`PyObject_Repr`. | 536 +-------------------+---------------------+----------------------------------+ 537 538 An unrecognized format character causes all the rest of the format string to be 539 copied as-is to the result string, and any extra arguments discarded. 540 541 .. note:: 542 The width formatter unit is number of characters rather than bytes. 543 The precision formatter unit is number of bytes for ``"%s"`` and 544 ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of 545 characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"`` 546 (if the ``PyObject*`` argument is not ``NULL``). 547 548 .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi, 549 zu, i, x): the 0-conversion flag has effect even when a precision is given. 550 551 .. versionchanged:: 3.2 552 Support for ``"%lld"`` and ``"%llu"`` added. 553 554 .. versionchanged:: 3.3 555 Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. 556 557 .. versionchanged:: 3.4 558 Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``, 559 ``"%V"``, ``"%S"``, ``"%R"`` added. 560 561 562.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 563 564 Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two 565 arguments. 566 567 568.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ 569 const char *encoding, const char *errors) 570 571 Decode an encoded object *obj* to a Unicode object. 572 573 :class:`bytes`, :class:`bytearray` and other 574 :term:`bytes-like objects <bytes-like object>` 575 are decoded according to the given *encoding* and using the error handling 576 defined by *errors*. Both can be ``NULL`` to have the interface use the default 577 values (see :ref:`builtincodecs` for details). 578 579 All other objects, including Unicode objects, cause a :exc:`TypeError` to be 580 set. 581 582 The API returns ``NULL`` if there was an error. The caller is responsible for 583 decref'ing the returned objects. 584 585 586.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) 587 588 Return the length of the Unicode object, in code points. 589 590 .. versionadded:: 3.3 591 592 593.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \ 594 Py_ssize_t to_start, \ 595 PyObject *from, \ 596 Py_ssize_t from_start, \ 597 Py_ssize_t how_many) 598 599 Copy characters from one Unicode object into another. This function performs 600 character conversion when necessary and falls back to :c:func:`memcpy` if 601 possible. Returns ``-1`` and sets an exception on error, otherwise returns 602 the number of copied characters. 603 604 .. versionadded:: 3.3 605 606 607.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ 608 Py_ssize_t length, Py_UCS4 fill_char) 609 610 Fill a string with a character: write *fill_char* into 611 ``unicode[start:start+length]``. 612 613 Fail if *fill_char* is bigger than the string maximum character, or if the 614 string has more than 1 reference. 615 616 Return the number of written character, or return ``-1`` and raise an 617 exception on error. 618 619 .. versionadded:: 3.3 620 621 622.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ 623 Py_UCS4 character) 624 625 Write a character to a string. The string must have been created through 626 :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, 627 the string must not be shared, or have been hashed yet. 628 629 This function checks that *unicode* is a Unicode object, that the index is 630 not out of bounds, and that the object can be modified safely (i.e. that it 631 its reference count is one). 632 633 .. versionadded:: 3.3 634 635 636.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) 637 638 Read a character from a string. This function checks that *unicode* is a 639 Unicode object and the index is not out of bounds, in contrast to the macro 640 version :c:func:`PyUnicode_READ_CHAR`. 641 642 .. versionadded:: 3.3 643 644 645.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ 646 Py_ssize_t end) 647 648 Return a substring of *str*, from character index *start* (included) to 649 character index *end* (excluded). Negative indices are not supported. 650 651 .. versionadded:: 3.3 652 653 654.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ 655 Py_ssize_t buflen, int copy_null) 656 657 Copy the string *u* into a UCS4 buffer, including a null character, if 658 *copy_null* is set. Returns ``NULL`` and sets an exception on error (in 659 particular, a :exc:`SystemError` if *buflen* is smaller than the length of 660 *u*). *buffer* is returned on success. 661 662 .. versionadded:: 3.3 663 664 665.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) 666 667 Copy the string *u* into a new UCS4 buffer that is allocated using 668 :c:func:`PyMem_Malloc`. If this fails, ``NULL`` is returned with a 669 :exc:`MemoryError` set. The returned buffer always has an extra 670 null code point appended. 671 672 .. versionadded:: 3.3 673 674 675Deprecated Py_UNICODE APIs 676"""""""""""""""""""""""""" 677 678.. deprecated-removed:: 3.3 3.12 679 680These API functions are deprecated with the implementation of :pep:`393`. 681Extension modules can continue using them, as they will not be removed in Python 6823.x, but need to be aware that their use can now cause performance and memory hits. 683 684 685.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 686 687 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 688 may be ``NULL`` which causes the contents to be undefined. It is the user's 689 responsibility to fill in the needed data. The buffer is copied into the new 690 object. 691 692 If the buffer is not ``NULL``, the return value might be a shared object. 693 Therefore, modification of the resulting Unicode object is only allowed when 694 *u* is ``NULL``. 695 696 If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the 697 string content has been filled before using any of the access macros such as 698 :c:func:`PyUnicode_KIND`. 699 700 .. deprecated-removed:: 3.3 3.12 701 Part of the old-style Unicode API, please migrate to using 702 :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or 703 :c:func:`PyUnicode_New`. 704 705 706.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 707 708 Return a read-only pointer to the Unicode object's internal 709 :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the 710 :c:type:`Py_UNICODE*` representation of the object if it is not yet 711 available. The buffer is always terminated with an extra null code point. 712 Note that the resulting :c:type:`Py_UNICODE` string may also contain 713 embedded null code points, which would cause the string to be truncated when 714 used in most C functions. 715 716 .. deprecated-removed:: 3.3 3.12 717 Part of the old-style Unicode API, please migrate to using 718 :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 719 :c:func:`PyUnicode_ReadChar` or similar new APIs. 720 721 722.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size) 723 724 Create a Unicode object by replacing all decimal digits in 725 :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9 726 according to their decimal value. Return ``NULL`` if an exception occurs. 727 728 .. deprecated-removed:: 3.3 3.11 729 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 730 :c:func:`Py_UNICODE_TODECIMAL`. 731 732 733.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) 734 735 Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` 736 array length (excluding the extra null terminator) in *size*. 737 Note that the resulting :c:type:`Py_UNICODE*` string 738 may contain embedded null code points, which would cause the string to be 739 truncated when used in most C functions. 740 741 .. versionadded:: 3.3 742 743 .. deprecated-removed:: 3.3 3.12 744 Part of the old-style Unicode API, please migrate to using 745 :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 746 :c:func:`PyUnicode_ReadChar` or similar new APIs. 747 748 749.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 750 751 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 752 code units (this includes surrogate pairs as 2 units). 753 754 .. deprecated-removed:: 3.3 3.12 755 Part of the old-style Unicode API, please migrate to using 756 :c:func:`PyUnicode_GET_LENGTH`. 757 758 759.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 760 761 Copy an instance of a Unicode subtype to a new true Unicode object if 762 necessary. If *obj* is already a true Unicode object (not a subtype), 763 return the reference with incremented refcount. 764 765 Objects other than Unicode or its subtypes will cause a :exc:`TypeError`. 766 767 768Locale Encoding 769""""""""""""""" 770 771The current locale encoding can be used to decode text from the operating 772system. 773 774.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ 775 Py_ssize_t len, \ 776 const char *errors) 777 778 Decode a string from UTF-8 on Android and VxWorks, or from the current 779 locale encoding on other platforms. The supported 780 error handlers are ``"strict"`` and ``"surrogateescape"`` 781 (:pep:`383`). The decoder uses ``"strict"`` error handler if 782 *errors* is ``NULL``. *str* must end with a null character but 783 cannot contain embedded null characters. 784 785 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from 786 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 787 Python startup). 788 789 This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`. 790 791 .. seealso:: 792 793 The :c:func:`Py_DecodeLocale` function. 794 795 .. versionadded:: 3.3 796 797 .. versionchanged:: 3.7 798 The function now also uses the current locale encoding for the 799 ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale` 800 was used for the ``surrogateescape``, and the current locale encoding was 801 used for ``strict``. 802 803 804.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) 805 806 Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string 807 length using :c:func:`strlen`. 808 809 .. versionadded:: 3.3 810 811 812.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) 813 814 Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current 815 locale encoding on other platforms. The 816 supported error handlers are ``"strict"`` and ``"surrogateescape"`` 817 (:pep:`383`). The encoder uses ``"strict"`` error handler if 818 *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot 819 contain embedded null characters. 820 821 Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to 822 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 823 Python startup). 824 825 This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`. 826 827 .. seealso:: 828 829 The :c:func:`Py_EncodeLocale` function. 830 831 .. versionadded:: 3.3 832 833 .. versionchanged:: 3.7 834 The function now also uses the current locale encoding for the 835 ``surrogateescape`` error handler, except on Android. Previously, 836 :c:func:`Py_EncodeLocale` 837 was used for the ``surrogateescape``, and the current locale encoding was 838 used for ``strict``. 839 840 841File System Encoding 842"""""""""""""""""""" 843 844To encode and decode file names and other environment strings, 845:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and 846:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler 847(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during 848argument parsing, the ``"O&"`` converter should be used, passing 849:c:func:`PyUnicode_FSConverter` as the conversion function: 850 851.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result) 852 853 ParseTuple converter: encode :class:`str` objects -- obtained directly or 854 through the :class:`os.PathLike` interface -- to :class:`bytes` using 855 :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is. 856 *result* must be a :c:type:`PyBytesObject*` which must be released when it is 857 no longer used. 858 859 .. versionadded:: 3.1 860 861 .. versionchanged:: 3.6 862 Accepts a :term:`path-like object`. 863 864To decode file names to :class:`str` during argument parsing, the ``"O&"`` 865converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the 866conversion function: 867 868.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result) 869 870 ParseTuple converter: decode :class:`bytes` objects -- obtained either 871 directly or indirectly through the :class:`os.PathLike` interface -- to 872 :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` 873 objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which 874 must be released when it is no longer used. 875 876 .. versionadded:: 3.2 877 878 .. versionchanged:: 3.6 879 Accepts a :term:`path-like object`. 880 881 882.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) 883 884 Decode a string from the :term:`filesystem encoding and error handler`. 885 886 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 887 locale encoding. 888 889 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 890 locale encoding and cannot be modified later. If you need to decode a string 891 from the current locale encoding, use 892 :c:func:`PyUnicode_DecodeLocaleAndSize`. 893 894 .. seealso:: 895 896 The :c:func:`Py_DecodeLocale` function. 897 898 .. versionchanged:: 3.6 899 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 900 901 902.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) 903 904 Decode a null-terminated string from the :term:`filesystem encoding and 905 error handler`. 906 907 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 908 locale encoding. 909 910 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. 911 912 .. versionchanged:: 3.6 913 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 914 915 916.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) 917 918 Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the 919 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return 920 :class:`bytes`. Note that the resulting :class:`bytes` object may contain 921 null bytes. 922 923 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 924 locale encoding. 925 926 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 927 locale encoding and cannot be modified later. If you need to encode a string 928 to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`. 929 930 .. seealso:: 931 932 The :c:func:`Py_EncodeLocale` function. 933 934 .. versionadded:: 3.2 935 936 .. versionchanged:: 3.6 937 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 938 939wchar_t Support 940""""""""""""""" 941 942:c:type:`wchar_t` support for platforms which support it: 943 944.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 945 946 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*. 947 Passing ``-1`` as the *size* indicates that the function must itself compute the length, 948 using wcslen. 949 Return ``NULL`` on failure. 950 951 952.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size) 953 954 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most 955 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 956 null termination character). Return the number of :c:type:`wchar_t` characters 957 copied or ``-1`` in case of an error. Note that the resulting :c:type:`wchar_t*` 958 string may or may not be null-terminated. It is the responsibility of the caller 959 to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is 960 required by the application. Also, note that the :c:type:`wchar_t*` string 961 might contain null characters, which would cause the string to be truncated 962 when used with most C functions. 963 964 965.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) 966 967 Convert the Unicode object to a wide character string. The output string 968 always ends with a null character. If *size* is not ``NULL``, write the number 969 of wide characters (excluding the trailing null termination character) into 970 *\*size*. Note that the resulting :c:type:`wchar_t` string might contain 971 null characters, which would cause the string to be truncated when used with 972 most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string 973 contains null characters a :exc:`ValueError` is raised. 974 975 Returns a buffer allocated by :c:func:`PyMem_Alloc` (use 976 :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL`` 977 and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation 978 is failed. 979 980 .. versionadded:: 3.2 981 982 .. versionchanged:: 3.7 983 Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*` 984 string contains null characters. 985 986 987.. _builtincodecs: 988 989Built-in Codecs 990^^^^^^^^^^^^^^^ 991 992Python provides a set of built-in codecs which are written in C for speed. All of 993these codecs are directly usable via the following functions. 994 995Many of the following APIs take two arguments encoding and errors, and they 996have the same semantics as the ones of the built-in :func:`str` string object 997constructor. 998 999Setting encoding to ``NULL`` causes the default encoding to be used 1000which is UTF-8. The file system calls should use 1001:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the 1002variable :c:data:`Py_FileSystemDefaultEncoding` internally. This 1003variable should be treated as read-only: on some systems, it will be a 1004pointer to a static string, on others, it will change at run-time 1005(such as when the application invokes setlocale). 1006 1007Error handling is set by errors which may also be set to ``NULL`` meaning to use 1008the default handling defined for the codec. Default error handling for all 1009built-in codecs is "strict" (:exc:`ValueError` is raised). 1010 1011The codecs all use a similar interface. Only deviation from the following 1012generic ones are documented for simplicity. 1013 1014 1015Generic Codecs 1016"""""""""""""" 1017 1018These are the generic codec APIs: 1019 1020 1021.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ 1022 const char *encoding, const char *errors) 1023 1024 Create a Unicode object by decoding *size* bytes of the encoded string *s*. 1025 *encoding* and *errors* have the same meaning as the parameters of the same name 1026 in the :func:`str` built-in function. The codec to be used is looked up 1027 using the Python codec registry. Return ``NULL`` if an exception was raised by 1028 the codec. 1029 1030 1031.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ 1032 const char *encoding, const char *errors) 1033 1034 Encode a Unicode object and return the result as Python bytes object. 1035 *encoding* and *errors* have the same meaning as the parameters of the same 1036 name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up 1037 using the Python codec registry. Return ``NULL`` if an exception was raised by 1038 the codec. 1039 1040 1041.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \ 1042 const char *encoding, const char *errors) 1043 1044 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python 1045 bytes object. *encoding* and *errors* have the same meaning as the 1046 parameters of the same name in the Unicode :meth:`~str.encode` method. The codec 1047 to be used is looked up using the Python codec registry. Return ``NULL`` if an 1048 exception was raised by the codec. 1049 1050 .. deprecated-removed:: 3.3 3.11 1051 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1052 :c:func:`PyUnicode_AsEncodedString`. 1053 1054 1055UTF-8 Codecs 1056"""""""""""" 1057 1058These are the UTF-8 codec APIs: 1059 1060 1061.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 1062 1063 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 1064 *s*. Return ``NULL`` if an exception was raised by the codec. 1065 1066 1067.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ 1068 const char *errors, Py_ssize_t *consumed) 1069 1070 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If 1071 *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be 1072 treated as an error. Those bytes will not be decoded and the number of bytes 1073 that have been decoded will be stored in *consumed*. 1074 1075 1076.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 1077 1078 Encode a Unicode object using UTF-8 and return the result as Python bytes 1079 object. Error handling is "strict". Return ``NULL`` if an exception was 1080 raised by the codec. 1081 1082 1083.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) 1084 1085 Return a pointer to the UTF-8 encoding of the Unicode object, and 1086 store the size of the encoded representation (in bytes) in *size*. The 1087 *size* argument can be ``NULL``; in this case no size will be stored. The 1088 returned buffer always has an extra null byte appended (not included in 1089 *size*), regardless of whether there are any other null code points. 1090 1091 In the case of an error, ``NULL`` is returned with an exception set and no 1092 *size* is stored. 1093 1094 This caches the UTF-8 representation of the string in the Unicode object, and 1095 subsequent calls will return a pointer to the same buffer. The caller is not 1096 responsible for deallocating the buffer. 1097 1098 .. versionadded:: 3.3 1099 1100 .. versionchanged:: 3.7 1101 The return type is now ``const char *`` rather of ``char *``. 1102 1103 .. versionchanged:: 3.10 1104 This function is a part of the :ref:`limited API <stable>`. 1105 1106 1107.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode) 1108 1109 As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. 1110 1111 .. versionadded:: 3.3 1112 1113 .. versionchanged:: 3.7 1114 The return type is now ``const char *`` rather of ``char *``. 1115 1116 1117.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1118 1119 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and 1120 return a Python bytes object. Return ``NULL`` if an exception was raised by 1121 the codec. 1122 1123 .. deprecated-removed:: 3.3 3.11 1124 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1125 :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or 1126 :c:func:`PyUnicode_AsEncodedString`. 1127 1128 1129UTF-32 Codecs 1130""""""""""""" 1131 1132These are the UTF-32 codec APIs: 1133 1134 1135.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ 1136 const char *errors, int *byteorder) 1137 1138 Decode *size* bytes from a UTF-32 encoded buffer string and return the 1139 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1140 handling. It defaults to "strict". 1141 1142 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1143 order:: 1144 1145 *byteorder == -1: little endian 1146 *byteorder == 0: native order 1147 *byteorder == 1: big endian 1148 1149 If ``*byteorder`` is zero, and the first four bytes of the input data are a 1150 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1151 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1152 ``1``, any byte order mark is copied to the output. 1153 1154 After completion, *\*byteorder* is set to the current byte order at the end 1155 of input data. 1156 1157 If *byteorder* is ``NULL``, the codec starts in native order mode. 1158 1159 Return ``NULL`` if an exception was raised by the codec. 1160 1161 1162.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ 1163 const char *errors, int *byteorder, Py_ssize_t *consumed) 1164 1165 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If 1166 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 1167 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 1168 by four) as an error. Those bytes will not be decoded and the number of bytes 1169 that have been decoded will be stored in *consumed*. 1170 1171 1172.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 1173 1174 Return a Python byte string using the UTF-32 encoding in native byte 1175 order. The string always starts with a BOM mark. Error handling is "strict". 1176 Return ``NULL`` if an exception was raised by the codec. 1177 1178 1179.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \ 1180 const char *errors, int byteorder) 1181 1182 Return a Python bytes object holding the UTF-32 encoded value of the Unicode 1183 data in *s*. Output is written according to the following byte order:: 1184 1185 byteorder == -1: little endian 1186 byteorder == 0: native byte order (writes a BOM mark) 1187 byteorder == 1: big endian 1188 1189 If byteorder is ``0``, the output string will always start with the Unicode BOM 1190 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1191 1192 If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output 1193 as a single code point. 1194 1195 Return ``NULL`` if an exception was raised by the codec. 1196 1197 .. deprecated-removed:: 3.3 3.11 1198 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1199 :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`. 1200 1201 1202UTF-16 Codecs 1203""""""""""""" 1204 1205These are the UTF-16 codec APIs: 1206 1207 1208.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ 1209 const char *errors, int *byteorder) 1210 1211 Decode *size* bytes from a UTF-16 encoded buffer string and return the 1212 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1213 handling. It defaults to "strict". 1214 1215 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1216 order:: 1217 1218 *byteorder == -1: little endian 1219 *byteorder == 0: native order 1220 *byteorder == 1: big endian 1221 1222 If ``*byteorder`` is zero, and the first two bytes of the input data are a 1223 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1224 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1225 ``1``, any byte order mark is copied to the output (where it will result in 1226 either a ``\ufeff`` or a ``\ufffe`` character). 1227 1228 After completion, *\*byteorder* is set to the current byte order at the end 1229 of input data. 1230 1231 If *byteorder* is ``NULL``, the codec starts in native order mode. 1232 1233 Return ``NULL`` if an exception was raised by the codec. 1234 1235 1236.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ 1237 const char *errors, int *byteorder, Py_ssize_t *consumed) 1238 1239 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If 1240 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 1241 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 1242 split surrogate pair) as an error. Those bytes will not be decoded and the 1243 number of bytes that have been decoded will be stored in *consumed*. 1244 1245 1246.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 1247 1248 Return a Python byte string using the UTF-16 encoding in native byte 1249 order. The string always starts with a BOM mark. Error handling is "strict". 1250 Return ``NULL`` if an exception was raised by the codec. 1251 1252 1253.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \ 1254 const char *errors, int byteorder) 1255 1256 Return a Python bytes object holding the UTF-16 encoded value of the Unicode 1257 data in *s*. Output is written according to the following byte order:: 1258 1259 byteorder == -1: little endian 1260 byteorder == 0: native byte order (writes a BOM mark) 1261 byteorder == 1: big endian 1262 1263 If byteorder is ``0``, the output string will always start with the Unicode BOM 1264 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1265 1266 If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get 1267 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE` 1268 values is interpreted as a UCS-2 character. 1269 1270 Return ``NULL`` if an exception was raised by the codec. 1271 1272 .. deprecated-removed:: 3.3 3.11 1273 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1274 :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`. 1275 1276 1277UTF-7 Codecs 1278"""""""""""" 1279 1280These are the UTF-7 codec APIs: 1281 1282 1283.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 1284 1285 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 1286 *s*. Return ``NULL`` if an exception was raised by the codec. 1287 1288 1289.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ 1290 const char *errors, Py_ssize_t *consumed) 1291 1292 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`. If 1293 *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not 1294 be treated as an error. Those bytes will not be decoded and the number of 1295 bytes that have been decoded will be stored in *consumed*. 1296 1297 1298.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \ 1299 int base64SetO, int base64WhiteSpace, const char *errors) 1300 1301 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and 1302 return a Python bytes object. Return ``NULL`` if an exception was raised by 1303 the codec. 1304 1305 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise 1306 special meaning) will be encoded in base-64. If *base64WhiteSpace* is 1307 nonzero, whitespace will be encoded in base-64. Both are set to zero for the 1308 Python "utf-7" codec. 1309 1310 .. deprecated-removed:: 3.3 3.11 1311 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1312 :c:func:`PyUnicode_AsEncodedString`. 1313 1314 1315Unicode-Escape Codecs 1316""""""""""""""""""""" 1317 1318These are the "Unicode Escape" codec APIs: 1319 1320 1321.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ 1322 Py_ssize_t size, const char *errors) 1323 1324 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 1325 string *s*. Return ``NULL`` if an exception was raised by the codec. 1326 1327 1328.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 1329 1330 Encode a Unicode object using Unicode-Escape and return the result as a 1331 bytes object. Error handling is "strict". Return ``NULL`` if an exception was 1332 raised by the codec. 1333 1334 1335.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) 1336 1337 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and 1338 return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1339 1340 .. deprecated-removed:: 3.3 3.11 1341 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1342 :c:func:`PyUnicode_AsUnicodeEscapeString`. 1343 1344 1345Raw-Unicode-Escape Codecs 1346""""""""""""""""""""""""" 1347 1348These are the "Raw Unicode Escape" codec APIs: 1349 1350 1351.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ 1352 Py_ssize_t size, const char *errors) 1353 1354 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 1355 encoded string *s*. Return ``NULL`` if an exception was raised by the codec. 1356 1357 1358.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 1359 1360 Encode a Unicode object using Raw-Unicode-Escape and return the result as 1361 a bytes object. Error handling is "strict". Return ``NULL`` if an exception 1362 was raised by the codec. 1363 1364 1365.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \ 1366 Py_ssize_t size) 1367 1368 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape 1369 and return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1370 1371 .. deprecated-removed:: 3.3 3.11 1372 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1373 :c:func:`PyUnicode_AsRawUnicodeEscapeString` or 1374 :c:func:`PyUnicode_AsEncodedString`. 1375 1376 1377Latin-1 Codecs 1378"""""""""""""" 1379 1380These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 1381ordinals and only these are accepted by the codecs during encoding. 1382 1383 1384.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 1385 1386 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 1387 *s*. Return ``NULL`` if an exception was raised by the codec. 1388 1389 1390.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 1391 1392 Encode a Unicode object using Latin-1 and return the result as Python bytes 1393 object. Error handling is "strict". Return ``NULL`` if an exception was 1394 raised by the codec. 1395 1396 1397.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1398 1399 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and 1400 return a Python bytes object. Return ``NULL`` if an exception was raised by 1401 the codec. 1402 1403 .. deprecated-removed:: 3.3 3.11 1404 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1405 :c:func:`PyUnicode_AsLatin1String` or 1406 :c:func:`PyUnicode_AsEncodedString`. 1407 1408 1409ASCII Codecs 1410"""""""""""" 1411 1412These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 1413codes generate errors. 1414 1415 1416.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 1417 1418 Create a Unicode object by decoding *size* bytes of the ASCII encoded string 1419 *s*. Return ``NULL`` if an exception was raised by the codec. 1420 1421 1422.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 1423 1424 Encode a Unicode object using ASCII and return the result as Python bytes 1425 object. Error handling is "strict". Return ``NULL`` if an exception was 1426 raised by the codec. 1427 1428 1429.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1430 1431 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and 1432 return a Python bytes object. Return ``NULL`` if an exception was raised by 1433 the codec. 1434 1435 .. deprecated-removed:: 3.3 3.11 1436 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1437 :c:func:`PyUnicode_AsASCIIString` or 1438 :c:func:`PyUnicode_AsEncodedString`. 1439 1440 1441Character Map Codecs 1442"""""""""""""""""""" 1443 1444This codec is special in that it can be used to implement many different codecs 1445(and this is in fact what was done to obtain most of the standard codecs 1446included in the :mod:`encodings` package). The codec uses mapping to encode and 1447decode characters. The mapping objects provided must support the 1448:meth:`__getitem__` mapping interface; dictionaries and sequences work well. 1449 1450These are the mapping codec APIs: 1451 1452.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \ 1453 PyObject *mapping, const char *errors) 1454 1455 Create a Unicode object by decoding *size* bytes of the encoded string *s* 1456 using the given *mapping* object. Return ``NULL`` if an exception was raised 1457 by the codec. 1458 1459 If *mapping* is ``NULL``, Latin-1 decoding will be applied. Else 1460 *mapping* must map bytes ordinals (integers in the range from 0 to 255) 1461 to Unicode strings, integers (which are then interpreted as Unicode 1462 ordinals) or ``None``. Unmapped data bytes -- ones which cause a 1463 :exc:`LookupError`, as well as ones which get mapped to ``None``, 1464 ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause 1465 an error. 1466 1467 1468.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 1469 1470 Encode a Unicode object using the given *mapping* object and return the 1471 result as a bytes object. Error handling is "strict". Return ``NULL`` if an 1472 exception was raised by the codec. 1473 1474 The *mapping* object must map Unicode ordinal integers to bytes objects, 1475 integers in the range from 0 to 255 or ``None``. Unmapped character 1476 ordinals (ones which cause a :exc:`LookupError`) as well as mapped to 1477 ``None`` are treated as "undefined mapping" and cause an error. 1478 1479 1480.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1481 PyObject *mapping, const char *errors) 1482 1483 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given 1484 *mapping* object and return the result as a bytes object. Return ``NULL`` if 1485 an exception was raised by the codec. 1486 1487 .. deprecated-removed:: 3.3 3.11 1488 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1489 :c:func:`PyUnicode_AsCharmapString` or 1490 :c:func:`PyUnicode_AsEncodedString`. 1491 1492 1493The following codec API is special in that maps Unicode to Unicode. 1494 1495.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) 1496 1497 Translate a string by applying a character mapping table to it and return the 1498 resulting Unicode object. Return ``NULL`` if an exception was raised by the 1499 codec. 1500 1501 The mapping table must map Unicode ordinal integers to Unicode ordinal integers 1502 or ``None`` (causing deletion of the character). 1503 1504 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 1505 and sequences work well. Unmapped character ordinals (ones which cause a 1506 :exc:`LookupError`) are left untouched and are copied as-is. 1507 1508 *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to 1509 use the default error handling. 1510 1511 1512.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1513 PyObject *mapping, const char *errors) 1514 1515 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a 1516 character *mapping* table to it and return the resulting Unicode object. 1517 Return ``NULL`` when an exception was raised by the codec. 1518 1519 .. deprecated-removed:: 3.3 3.11 1520 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1521 :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API 1522 <codec-registry>` 1523 1524 1525MBCS codecs for Windows 1526""""""""""""""""""""""" 1527 1528These are the MBCS codec APIs. They are currently only available on Windows and 1529use the Win32 MBCS converters to implement the conversions. Note that MBCS (or 1530DBCS) is a class of encodings, not just one. The target encoding is defined by 1531the user settings on the machine running the codec. 1532 1533.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 1534 1535 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 1536 Return ``NULL`` if an exception was raised by the codec. 1537 1538 1539.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \ 1540 const char *errors, Py_ssize_t *consumed) 1541 1542 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If 1543 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 1544 trailing lead byte and the number of bytes that have been decoded will be stored 1545 in *consumed*. 1546 1547 1548.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 1549 1550 Encode a Unicode object using MBCS and return the result as Python bytes 1551 object. Error handling is "strict". Return ``NULL`` if an exception was 1552 raised by the codec. 1553 1554 1555.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) 1556 1557 Encode the Unicode object using the specified code page and return a Python 1558 bytes object. Return ``NULL`` if an exception was raised by the codec. Use 1559 :c:data:`CP_ACP` code page to get the MBCS encoder. 1560 1561 .. versionadded:: 3.3 1562 1563 1564.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1565 1566 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return 1567 a Python bytes object. Return ``NULL`` if an exception was raised by the 1568 codec. 1569 1570 .. deprecated-removed:: 3.3 4.0 1571 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1572 :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or 1573 :c:func:`PyUnicode_AsEncodedString`. 1574 1575 1576Methods & Slots 1577""""""""""""""" 1578 1579 1580.. _unicodemethodsandslots: 1581 1582Methods and Slot Functions 1583^^^^^^^^^^^^^^^^^^^^^^^^^^ 1584 1585The following APIs are capable of handling Unicode objects and strings on input 1586(we refer to them as strings in the descriptions) and return Unicode objects or 1587integers as appropriate. 1588 1589They all return ``NULL`` or ``-1`` if an exception occurs. 1590 1591 1592.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 1593 1594 Concat two strings giving a new Unicode string. 1595 1596 1597.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 1598 1599 Split a string giving a list of Unicode strings. If *sep* is ``NULL``, splitting 1600 will be done at all whitespace substrings. Otherwise, splits occur at the given 1601 separator. At most *maxsplit* splits will be done. If negative, no limit is 1602 set. Separators are not included in the resulting list. 1603 1604 1605.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 1606 1607 Split a Unicode string at line breaks, returning a list of Unicode strings. 1608 CRLF is considered to be one line break. If *keepend* is ``0``, the Line break 1609 characters are not included in the resulting strings. 1610 1611 1612.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 1613 1614 Join a sequence of strings using the given *separator* and return the resulting 1615 Unicode string. 1616 1617 1618.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ 1619 Py_ssize_t start, Py_ssize_t end, int direction) 1620 1621 Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 1622 (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 1623 ``0`` otherwise. Return ``-1`` if an error occurred. 1624 1625 1626.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ 1627 Py_ssize_t start, Py_ssize_t end, int direction) 1628 1629 Return the first position of *substr* in ``str[start:end]`` using the given 1630 *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 1631 backward search). The return value is the index of the first match; a value of 1632 ``-1`` indicates that no match was found, and ``-2`` indicates that an error 1633 occurred and an exception has been set. 1634 1635 1636.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ 1637 Py_ssize_t start, Py_ssize_t end, int direction) 1638 1639 Return the first position of the character *ch* in ``str[start:end]`` using 1640 the given *direction* (*direction* == ``1`` means to do a forward search, 1641 *direction* == ``-1`` a backward search). The return value is the index of the 1642 first match; a value of ``-1`` indicates that no match was found, and ``-2`` 1643 indicates that an error occurred and an exception has been set. 1644 1645 .. versionadded:: 3.3 1646 1647 .. versionchanged:: 3.7 1648 *start* and *end* are now adjusted to behave like ``str[start:end]``. 1649 1650 1651.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ 1652 Py_ssize_t start, Py_ssize_t end) 1653 1654 Return the number of non-overlapping occurrences of *substr* in 1655 ``str[start:end]``. Return ``-1`` if an error occurred. 1656 1657 1658.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ 1659 PyObject *replstr, Py_ssize_t maxcount) 1660 1661 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 1662 return the resulting Unicode object. *maxcount* == ``-1`` means replace all 1663 occurrences. 1664 1665 1666.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 1667 1668 Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 1669 respectively. 1670 1671 This function returns ``-1`` upon failure, so one should call 1672 :c:func:`PyErr_Occurred` to check for errors. 1673 1674 1675.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string) 1676 1677 Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less 1678 than, equal, and greater than, respectively. It is best to pass only 1679 ASCII-encoded strings, but the function interprets the input string as 1680 ISO-8859-1 if it contains non-ASCII characters. 1681 1682 This function does not raise exceptions. 1683 1684 1685.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 1686 1687 Rich compare two Unicode strings and return one of the following: 1688 1689 * ``NULL`` in case an exception was raised 1690 * :const:`Py_True` or :const:`Py_False` for successful comparisons 1691 * :const:`Py_NotImplemented` in case the type combination is unknown 1692 1693 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 1694 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 1695 1696 1697.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 1698 1699 Return a new string object from *format* and *args*; this is analogous to 1700 ``format % args``. 1701 1702 1703.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 1704 1705 Check whether *element* is contained in *container* and return true or false 1706 accordingly. 1707 1708 *element* has to coerce to a one element Unicode string. ``-1`` is returned 1709 if there was an error. 1710 1711 1712.. c:function:: void PyUnicode_InternInPlace(PyObject **string) 1713 1714 Intern the argument *\*string* in place. The argument must be the address of a 1715 pointer variable pointing to a Python Unicode string object. If there is an 1716 existing interned string that is the same as *\*string*, it sets *\*string* to 1717 it (decrementing the reference count of the old string object and incrementing 1718 the reference count of the interned string object), otherwise it leaves 1719 *\*string* alone and interns it (incrementing its reference count). 1720 (Clarification: even though there is a lot of talk about reference counts, think 1721 of this function as reference-count-neutral; you own the object after the call 1722 if and only if you owned it before the call.) 1723 1724 1725.. c:function:: PyObject* PyUnicode_InternFromString(const char *v) 1726 1727 A combination of :c:func:`PyUnicode_FromString` and 1728 :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string 1729 object that has been interned, or a new ("owned") reference to an earlier 1730 interned string object with the same value. 1731