• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1.. highlight:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
9.. sectionauthor:: Georg Brandl <georg@python.org>
10
11Unicode Objects
12^^^^^^^^^^^^^^^
13
14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally
15use a variety of representations, in order to allow handling the complete range
16of Unicode characters while staying memory efficient.  There are special cases
17for strings where all code points are below 128, 256, or 65536; otherwise, code
18points must be below 1114112 (which is the full Unicode range).
19
20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
21in the Unicode object.  The :c:type:`Py_UNICODE*` representation is deprecated
22and inefficient; it should be avoided in performance- or memory-sensitive
23situations.
24
25Due to the transition between the old APIs and the new APIs, Unicode objects
26can internally be in two states depending on how they were created:
27
28* "canonical" Unicode objects are all objects created by a non-deprecated
29  Unicode API.  They use the most efficient representation allowed by the
30  implementation.
31
32* "legacy" Unicode objects have been created through one of the deprecated
33  APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
34  :c:type:`Py_UNICODE*` representation; you will have to call
35  :c:func:`PyUnicode_READY` on them before calling any other API.
36
37
38Unicode Type
39""""""""""""
40
41These are the basic Unicode object types used for the Unicode implementation in
42Python:
43
44.. c:type:: Py_UCS4
45            Py_UCS2
46            Py_UCS1
47
48   These types are typedefs for unsigned integer types wide enough to contain
49   characters of 32 bits, 16 bits and 8 bits, respectively.  When dealing with
50   single Unicode characters, use :c:type:`Py_UCS4`.
51
52   .. versionadded:: 3.3
53
54
55.. c:type:: Py_UNICODE
56
57   This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type
58   depending on the platform.
59
60   .. versionchanged:: 3.3
61      In previous versions, this was a 16-bit type or a 32-bit type depending on
62      whether you selected a "narrow" or "wide" Unicode version of Python at
63      build time.
64
65
66.. c:type:: PyASCIIObject
67            PyCompactUnicodeObject
68            PyUnicodeObject
69
70   These subtypes of :c:type:`PyObject` represent a Python Unicode object.  In
71   almost all cases, they shouldn't be used directly, since all API functions
72   that deal with Unicode objects take and return :c:type:`PyObject` pointers.
73
74   .. versionadded:: 3.3
75
76
77.. c:var:: PyTypeObject PyUnicode_Type
78
79   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
80   is exposed to Python code as ``str``.
81
82
83The following APIs are really C macros and can be used to do fast checks and to
84access internal read-only data of Unicode objects:
85
86.. c:function:: int PyUnicode_Check(PyObject *o)
87
88   Return true if the object *o* is a Unicode object or an instance of a Unicode
89   subtype.
90
91
92.. c:function:: int PyUnicode_CheckExact(PyObject *o)
93
94   Return true if the object *o* is a Unicode object, but not an instance of a
95   subtype.
96
97
98.. c:function:: int PyUnicode_READY(PyObject *o)
99
100   Ensure the string object *o* is in the "canonical" representation.  This is
101   required before using any of the access macros described below.
102
103   .. XXX expand on when it is not required
104
105   Returns ``0`` on success and ``-1`` with an exception set on failure, which in
106   particular happens if memory allocation fails.
107
108   .. versionadded:: 3.3
109
110
111.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
112
113   Return the length of the Unicode string, in code points.  *o* has to be a
114   Unicode object in the "canonical" representation (not checked).
115
116   .. versionadded:: 3.3
117
118
119.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
120                Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
121                Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)
122
123   Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
124   integer types for direct character access.  No checks are performed if the
125   canonical representation has the correct character size; use
126   :c:func:`PyUnicode_KIND` to select the right macro.  Make sure
127   :c:func:`PyUnicode_READY` has been called before accessing this.
128
129   .. versionadded:: 3.3
130
131
132.. c:macro:: PyUnicode_WCHAR_KIND
133             PyUnicode_1BYTE_KIND
134             PyUnicode_2BYTE_KIND
135             PyUnicode_4BYTE_KIND
136
137   Return values of the :c:func:`PyUnicode_KIND` macro.
138
139   .. versionadded:: 3.3
140
141
142.. c:function:: int PyUnicode_KIND(PyObject *o)
143
144   Return one of the PyUnicode kind constants (see above) that indicate how many
145   bytes per character this Unicode object uses to store its data.  *o* has to
146   be a Unicode object in the "canonical" representation (not checked).
147
148   .. XXX document "0" return value?
149
150   .. versionadded:: 3.3
151
152
153.. c:function:: void* PyUnicode_DATA(PyObject *o)
154
155   Return a void pointer to the raw Unicode buffer.  *o* has to be a Unicode
156   object in the "canonical" representation (not checked).
157
158   .. versionadded:: 3.3
159
160
161.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \
162                                     Py_UCS4 value)
163
164   Write into a canonical representation *data* (as obtained with
165   :c:func:`PyUnicode_DATA`).  This macro does not do any sanity checks and is
166   intended for usage in loops.  The caller should cache the *kind* value and
167   *data* pointer as obtained from other macro calls.  *index* is the index in
168   the string (starts at 0) and *value* is the new code point value which should
169   be written to that location.
170
171   .. versionadded:: 3.3
172
173
174.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)
175
176   Read a code point from a canonical representation *data* (as obtained with
177   :c:func:`PyUnicode_DATA`).  No checks or ready calls are performed.
178
179   .. versionadded:: 3.3
180
181
182.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)
183
184   Read a character from a Unicode object *o*, which must be in the "canonical"
185   representation.  This is less efficient than :c:func:`PyUnicode_READ` if you
186   do multiple consecutive reads.
187
188   .. versionadded:: 3.3
189
190
191.. c:function:: PyUnicode_MAX_CHAR_VALUE(PyObject *o)
192
193   Return the maximum code point that is suitable for creating another string
194   based on *o*, which must be in the "canonical" representation.  This is
195   always an approximation but more efficient than iterating over the string.
196
197   .. versionadded:: 3.3
198
199
200.. c:function:: int PyUnicode_ClearFreeList()
201
202   Clear the free list. Return the total number of freed items.
203
204
205.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
206
207   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
208   code units (this includes surrogate pairs as 2 units).  *o* has to be a
209   Unicode object (not checked).
210
211   .. deprecated-removed:: 3.3 4.0
212      Part of the old-style Unicode API, please migrate to using
213      :c:func:`PyUnicode_GET_LENGTH`.
214
215
216.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
217
218   Return the size of the deprecated :c:type:`Py_UNICODE` representation in
219   bytes.  *o* has to be a Unicode object (not checked).
220
221   .. deprecated-removed:: 3.3 4.0
222      Part of the old-style Unicode API, please migrate to using
223      :c:func:`PyUnicode_GET_LENGTH`.
224
225
226.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
227                const char* PyUnicode_AS_DATA(PyObject *o)
228
229   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
230   returned buffer is always terminated with an extra null code point.  It
231   may also contain embedded null code points, which would cause the string
232   to be truncated when used in most C functions.  The ``AS_DATA`` form
233   casts the pointer to :c:type:`const char *`.  The *o* argument has to be
234   a Unicode object (not checked).
235
236   .. versionchanged:: 3.3
237      This macro is now inefficient -- because in many cases the
238      :c:type:`Py_UNICODE` representation does not exist and needs to be created
239      -- and can fail (return ``NULL`` with an exception set).  Try to port the
240      code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
241      :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.
242
243   .. deprecated-removed:: 3.3 4.0
244      Part of the old-style Unicode API, please migrate to using the
245      :c:func:`PyUnicode_nBYTE_DATA` family of macros.
246
247
248Unicode Character Properties
249""""""""""""""""""""""""""""
250
251Unicode provides many different character properties. The most often needed ones
252are available through these macros which are mapped to C functions depending on
253the Python configuration.
254
255
256.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
257
258   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
259
260
261.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
262
263   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
264
265
266.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
267
268   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
269
270
271.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
272
273   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
274
275
276.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
277
278   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
279
280
281.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
282
283   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
284
285
286.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
287
288   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
289
290
291.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
292
293   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
294
295
296.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
297
298   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
299
300
301.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
302
303   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
304
305
306.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
307
308   Return ``1`` or ``0`` depending on whether *ch* is a printable character.
309   Nonprintable characters are those characters defined in the Unicode character
310   database as "Other" or "Separator", excepting the ASCII space (0x20) which is
311   considered printable.  (Note that printable characters in this context are
312   those which should not be escaped when :func:`repr` is invoked on a string.
313   It has no bearing on the handling of strings written to :data:`sys.stdout` or
314   :data:`sys.stderr`.)
315
316
317These APIs can be used for fast direct character conversions:
318
319
320.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
321
322   Return the character *ch* converted to lower case.
323
324   .. deprecated:: 3.3
325      This function uses simple case mappings.
326
327
328.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
329
330   Return the character *ch* converted to upper case.
331
332   .. deprecated:: 3.3
333      This function uses simple case mappings.
334
335
336.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
337
338   Return the character *ch* converted to title case.
339
340   .. deprecated:: 3.3
341      This function uses simple case mappings.
342
343
344.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
345
346   Return the character *ch* converted to a decimal positive integer.  Return
347   ``-1`` if this is not possible.  This macro does not raise exceptions.
348
349
350.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
351
352   Return the character *ch* converted to a single digit integer. Return ``-1`` if
353   this is not possible.  This macro does not raise exceptions.
354
355
356.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
357
358   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
359   possible.  This macro does not raise exceptions.
360
361
362These APIs can be used to work with surrogates:
363
364.. c:macro:: Py_UNICODE_IS_SURROGATE(ch)
365
366   Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``).
367
368.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch)
369
370   Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``).
371
372.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch)
373
374   Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``).
375
376.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low)
377
378   Join two surrogate characters and return a single Py_UCS4 value.
379   *high* and *low* are respectively the leading and trailing surrogates in a
380   surrogate pair.
381
382
383Creating and accessing Unicode strings
384""""""""""""""""""""""""""""""""""""""
385
386To create Unicode objects and access their basic sequence properties, use these
387APIs:
388
389.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
390
391   Create a new Unicode object.  *maxchar* should be the true maximum code point
392   to be placed in the string.  As an approximation, it can be rounded up to the
393   nearest value in the sequence 127, 255, 65535, 1114111.
394
395   This is the recommended way to allocate a new Unicode object.  Objects
396   created using this function are not resizable.
397
398   .. versionadded:: 3.3
399
400
401.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \
402                                                    Py_ssize_t size)
403
404   Create a new Unicode object with the given *kind* (possible values are
405   :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by
406   :c:func:`PyUnicode_KIND`).  The *buffer* must point to an array of *size*
407   units of 1, 2 or 4 bytes per character, as given by the kind.
408
409   .. versionadded:: 3.3
410
411
412.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
413
414   Create a Unicode object from the char buffer *u*.  The bytes will be
415   interpreted as being UTF-8 encoded.  The buffer is copied into the new
416   object. If the buffer is not ``NULL``, the return value might be a shared
417   object, i.e. modification of the data is not allowed.
418
419   If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
420   with the buffer set to ``NULL``.  This usage is deprecated in favor of
421   :c:func:`PyUnicode_New`.
422
423
424.. c:function:: PyObject *PyUnicode_FromString(const char *u)
425
426   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
427   *u*.
428
429
430.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
431
432   Take a C :c:func:`printf`\ -style *format* string and a variable number of
433   arguments, calculate the size of the resulting Python Unicode string and return
434   a string with the values formatted into it.  The variable arguments must be C
435   types and must correspond exactly to the format characters in the *format*
436   ASCII-encoded string. The following format characters are allowed:
437
438   .. % This should be exactly the same as the table in PyErr_Format.
439   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
440   .. % because not all compilers support the %z width modifier -- we fake it
441   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
442   .. % Similar comments apply to the %ll width modifier and
443
444   .. tabularcolumns:: |l|l|L|
445
446   +-------------------+---------------------+----------------------------------+
447   | Format Characters | Type                | Comment                          |
448   +===================+=====================+==================================+
449   | :attr:`%%`        | *n/a*               | The literal % character.         |
450   +-------------------+---------------------+----------------------------------+
451   | :attr:`%c`        | int                 | A single character,              |
452   |                   |                     | represented as a C int.          |
453   +-------------------+---------------------+----------------------------------+
454   | :attr:`%d`        | int                 | Equivalent to                    |
455   |                   |                     | ``printf("%d")``. [1]_           |
456   +-------------------+---------------------+----------------------------------+
457   | :attr:`%u`        | unsigned int        | Equivalent to                    |
458   |                   |                     | ``printf("%u")``. [1]_           |
459   +-------------------+---------------------+----------------------------------+
460   | :attr:`%ld`       | long                | Equivalent to                    |
461   |                   |                     | ``printf("%ld")``. [1]_          |
462   +-------------------+---------------------+----------------------------------+
463   | :attr:`%li`       | long                | Equivalent to                    |
464   |                   |                     | ``printf("%li")``. [1]_          |
465   +-------------------+---------------------+----------------------------------+
466   | :attr:`%lu`       | unsigned long       | Equivalent to                    |
467   |                   |                     | ``printf("%lu")``. [1]_          |
468   +-------------------+---------------------+----------------------------------+
469   | :attr:`%lld`      | long long           | Equivalent to                    |
470   |                   |                     | ``printf("%lld")``. [1]_         |
471   +-------------------+---------------------+----------------------------------+
472   | :attr:`%lli`      | long long           | Equivalent to                    |
473   |                   |                     | ``printf("%lli")``. [1]_         |
474   +-------------------+---------------------+----------------------------------+
475   | :attr:`%llu`      | unsigned long long  | Equivalent to                    |
476   |                   |                     | ``printf("%llu")``. [1]_         |
477   +-------------------+---------------------+----------------------------------+
478   | :attr:`%zd`       | Py_ssize_t          | Equivalent to                    |
479   |                   |                     | ``printf("%zd")``. [1]_          |
480   +-------------------+---------------------+----------------------------------+
481   | :attr:`%zi`       | Py_ssize_t          | Equivalent to                    |
482   |                   |                     | ``printf("%zi")``. [1]_          |
483   +-------------------+---------------------+----------------------------------+
484   | :attr:`%zu`       | size_t              | Equivalent to                    |
485   |                   |                     | ``printf("%zu")``. [1]_          |
486   +-------------------+---------------------+----------------------------------+
487   | :attr:`%i`        | int                 | Equivalent to                    |
488   |                   |                     | ``printf("%i")``. [1]_           |
489   +-------------------+---------------------+----------------------------------+
490   | :attr:`%x`        | int                 | Equivalent to                    |
491   |                   |                     | ``printf("%x")``. [1]_           |
492   +-------------------+---------------------+----------------------------------+
493   | :attr:`%s`        | const char\*        | A null-terminated C character    |
494   |                   |                     | array.                           |
495   +-------------------+---------------------+----------------------------------+
496   | :attr:`%p`        | const void\*        | The hex representation of a C    |
497   |                   |                     | pointer. Mostly equivalent to    |
498   |                   |                     | ``printf("%p")`` except that     |
499   |                   |                     | it is guaranteed to start with   |
500   |                   |                     | the literal ``0x`` regardless    |
501   |                   |                     | of what the platform's           |
502   |                   |                     | ``printf`` yields.               |
503   +-------------------+---------------------+----------------------------------+
504   | :attr:`%A`        | PyObject\*          | The result of calling            |
505   |                   |                     | :func:`ascii`.                   |
506   +-------------------+---------------------+----------------------------------+
507   | :attr:`%U`        | PyObject\*          | A Unicode object.                |
508   +-------------------+---------------------+----------------------------------+
509   | :attr:`%V`        | PyObject\*,         | A Unicode object (which may be   |
510   |                   | const char\*        | ``NULL``) and a null-terminated  |
511   |                   |                     | C character array as a second    |
512   |                   |                     | parameter (which will be used,   |
513   |                   |                     | if the first parameter is        |
514   |                   |                     | ``NULL``).                       |
515   +-------------------+---------------------+----------------------------------+
516   | :attr:`%S`        | PyObject\*          | The result of calling            |
517   |                   |                     | :c:func:`PyObject_Str`.          |
518   +-------------------+---------------------+----------------------------------+
519   | :attr:`%R`        | PyObject\*          | The result of calling            |
520   |                   |                     | :c:func:`PyObject_Repr`.         |
521   +-------------------+---------------------+----------------------------------+
522
523   An unrecognized format character causes all the rest of the format string to be
524   copied as-is to the result string, and any extra arguments discarded.
525
526   .. note::
527      The width formatter unit is number of characters rather than bytes.
528      The precision formatter unit is number of bytes for ``"%s"`` and
529      ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of
530      characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"``
531      (if the ``PyObject*`` argument is not ``NULL``).
532
533   .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi,
534      zu, i, x): the 0-conversion flag has effect even when a precision is given.
535
536   .. versionchanged:: 3.2
537      Support for ``"%lld"`` and ``"%llu"`` added.
538
539   .. versionchanged:: 3.3
540      Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added.
541
542   .. versionchanged:: 3.4
543      Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``,
544      ``"%V"``, ``"%S"``, ``"%R"`` added.
545
546
547.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
548
549   Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two
550   arguments.
551
552
553.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \
554                               const char *encoding, const char *errors)
555
556   Decode an encoded object *obj* to a Unicode object.
557
558   :class:`bytes`, :class:`bytearray` and other
559   :term:`bytes-like objects <bytes-like object>`
560   are decoded according to the given *encoding* and using the error handling
561   defined by *errors*. Both can be ``NULL`` to have the interface use the default
562   values (see :ref:`builtincodecs` for details).
563
564   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
565   set.
566
567   The API returns ``NULL`` if there was an error.  The caller is responsible for
568   decref'ing the returned objects.
569
570
571.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode)
572
573   Return the length of the Unicode object, in code points.
574
575   .. versionadded:: 3.3
576
577
578.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \
579                                                    Py_ssize_t to_start, \
580                                                    PyObject *from, \
581                                                    Py_ssize_t from_start, \
582                                                    Py_ssize_t how_many)
583
584   Copy characters from one Unicode object into another.  This function performs
585   character conversion when necessary and falls back to :c:func:`memcpy` if
586   possible.  Returns ``-1`` and sets an exception on error, otherwise returns
587   the number of copied characters.
588
589   .. versionadded:: 3.3
590
591
592.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \
593                        Py_ssize_t length, Py_UCS4 fill_char)
594
595   Fill a string with a character: write *fill_char* into
596   ``unicode[start:start+length]``.
597
598   Fail if *fill_char* is bigger than the string maximum character, or if the
599   string has more than 1 reference.
600
601   Return the number of written character, or return ``-1`` and raise an
602   exception on error.
603
604   .. versionadded:: 3.3
605
606
607.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \
608                                        Py_UCS4 character)
609
610   Write a character to a string.  The string must have been created through
611   :c:func:`PyUnicode_New`.  Since Unicode strings are supposed to be immutable,
612   the string must not be shared, or have been hashed yet.
613
614   This function checks that *unicode* is a Unicode object, that the index is
615   not out of bounds, and that the object can be modified safely (i.e. that it
616   its reference count is one).
617
618   .. versionadded:: 3.3
619
620
621.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index)
622
623   Read a character from a string.  This function checks that *unicode* is a
624   Unicode object and the index is not out of bounds, in contrast to the macro
625   version :c:func:`PyUnicode_READ_CHAR`.
626
627   .. versionadded:: 3.3
628
629
630.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \
631                                              Py_ssize_t end)
632
633   Return a substring of *str*, from character index *start* (included) to
634   character index *end* (excluded).  Negative indices are not supported.
635
636   .. versionadded:: 3.3
637
638
639.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \
640                                          Py_ssize_t buflen, int copy_null)
641
642   Copy the string *u* into a UCS4 buffer, including a null character, if
643   *copy_null* is set.  Returns ``NULL`` and sets an exception on error (in
644   particular, a :exc:`SystemError` if *buflen* is smaller than the length of
645   *u*).  *buffer* is returned on success.
646
647   .. versionadded:: 3.3
648
649
650.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u)
651
652   Copy the string *u* into a new UCS4 buffer that is allocated using
653   :c:func:`PyMem_Malloc`.  If this fails, ``NULL`` is returned with a
654   :exc:`MemoryError` set.  The returned buffer always has an extra
655   null code point appended.
656
657   .. versionadded:: 3.3
658
659
660Deprecated Py_UNICODE APIs
661""""""""""""""""""""""""""
662
663.. deprecated-removed:: 3.3 4.0
664
665These API functions are deprecated with the implementation of :pep:`393`.
666Extension modules can continue using them, as they will not be removed in Python
6673.x, but need to be aware that their use can now cause performance and memory hits.
668
669
670.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
671
672   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
673   may be ``NULL`` which causes the contents to be undefined. It is the user's
674   responsibility to fill in the needed data.  The buffer is copied into the new
675   object.
676
677   If the buffer is not ``NULL``, the return value might be a shared object.
678   Therefore, modification of the resulting Unicode object is only allowed when
679   *u* is ``NULL``.
680
681   If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
682   string content has been filled before using any of the access macros such as
683   :c:func:`PyUnicode_KIND`.
684
685   Please migrate to using :c:func:`PyUnicode_FromKindAndData`,
686   :c:func:`PyUnicode_FromWideChar` or :c:func:`PyUnicode_New`.
687
688
689.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
690
691   Return a read-only pointer to the Unicode object's internal
692   :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
693   :c:type:`Py_UNICODE*` representation of the object if it is not yet
694   available. The buffer is always terminated with an extra null code point.
695   Note that the resulting :c:type:`Py_UNICODE` string may also contain
696   embedded null code points, which would cause the string to be truncated when
697   used in most C functions.
698
699   Please migrate to using :c:func:`PyUnicode_AsUCS4`,
700   :c:func:`PyUnicode_AsWideChar`, :c:func:`PyUnicode_ReadChar` or similar new
701   APIs.
702
703   .. deprecated-removed:: 3.3 3.10
704
705
706.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size)
707
708   Create a Unicode object by replacing all decimal digits in
709   :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9
710   according to their decimal value.  Return ``NULL`` if an exception occurs.
711
712
713.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
714
715   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
716   array length (excluding the extra null terminator) in *size*.
717   Note that the resulting :c:type:`Py_UNICODE*` string
718   may contain embedded null code points, which would cause the string to be
719   truncated when used in most C functions.
720
721   .. versionadded:: 3.3
722
723
724.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
725
726   Create a copy of a Unicode string ending with a null code point. Return ``NULL``
727   and raise a :exc:`MemoryError` exception on memory allocation failure,
728   otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free
729   the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may
730   contain embedded null code points, which would cause the string to be
731   truncated when used in most C functions.
732
733   .. versionadded:: 3.2
734
735   Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs.
736
737
738.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
739
740   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
741   code units (this includes surrogate pairs as 2 units).
742
743   Please migrate to using :c:func:`PyUnicode_GetLength`.
744
745
746.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
747
748   Copy an instance of a Unicode subtype to a new true Unicode object if
749   necessary. If *obj* is already a true Unicode object (not a subtype),
750   return the reference with incremented refcount.
751
752   Objects other than Unicode or its subtypes will cause a :exc:`TypeError`.
753
754
755Locale Encoding
756"""""""""""""""
757
758The current locale encoding can be used to decode text from the operating
759system.
760
761.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \
762                                                        Py_ssize_t len, \
763                                                        const char *errors)
764
765   Decode a string from UTF-8 on Android and VxWorks, or from the current
766   locale encoding on other platforms. The supported
767   error handlers are ``"strict"`` and ``"surrogateescape"``
768   (:pep:`383`). The decoder uses ``"strict"`` error handler if
769   *errors* is ``NULL``.  *str* must end with a null character but
770   cannot contain embedded null characters.
771
772   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from
773   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
774   Python startup).
775
776   This function ignores the Python UTF-8 mode.
777
778   .. seealso::
779
780      The :c:func:`Py_DecodeLocale` function.
781
782   .. versionadded:: 3.3
783
784   .. versionchanged:: 3.7
785      The function now also uses the current locale encoding for the
786      ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale`
787      was used for the ``surrogateescape``, and the current locale encoding was
788      used for ``strict``.
789
790
791.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
792
793   Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string
794   length using :c:func:`strlen`.
795
796   .. versionadded:: 3.3
797
798
799.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors)
800
801   Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current
802   locale encoding on other platforms. The
803   supported error handlers are ``"strict"`` and ``"surrogateescape"``
804   (:pep:`383`). The encoder uses ``"strict"`` error handler if
805   *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot
806   contain embedded null characters.
807
808   Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to
809   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
810   Python startup).
811
812   This function ignores the Python UTF-8 mode.
813
814   .. seealso::
815
816      The :c:func:`Py_EncodeLocale` function.
817
818   .. versionadded:: 3.3
819
820   .. versionchanged:: 3.7
821      The function now also uses the current locale encoding for the
822      ``surrogateescape`` error handler, except on Android. Previously,
823      :c:func:`Py_EncodeLocale`
824      was used for the ``surrogateescape``, and the current locale encoding was
825      used for ``strict``.
826
827
828File System Encoding
829""""""""""""""""""""
830
831To encode and decode file names and other environment strings,
832:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
833:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
834(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
835argument parsing, the ``"O&"`` converter should be used, passing
836:c:func:`PyUnicode_FSConverter` as the conversion function:
837
838.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
839
840   ParseTuple converter: encode :class:`str` objects -- obtained directly or
841   through the :class:`os.PathLike` interface -- to :class:`bytes` using
842   :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
843   *result* must be a :c:type:`PyBytesObject*` which must be released when it is
844   no longer used.
845
846   .. versionadded:: 3.1
847
848   .. versionchanged:: 3.6
849      Accepts a :term:`path-like object`.
850
851To decode file names to :class:`str` during argument parsing, the ``"O&"``
852converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
853conversion function:
854
855.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
856
857   ParseTuple converter: decode :class:`bytes` objects -- obtained either
858   directly or indirectly through the :class:`os.PathLike` interface -- to
859   :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str`
860   objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which
861   must be released when it is no longer used.
862
863   .. versionadded:: 3.2
864
865   .. versionchanged:: 3.6
866      Accepts a :term:`path-like object`.
867
868
869.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
870
871   Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the
872   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
873
874   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
875   locale encoding.
876
877   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
878   locale encoding and cannot be modified later. If you need to decode a string
879   from the current locale encoding, use
880   :c:func:`PyUnicode_DecodeLocaleAndSize`.
881
882   .. seealso::
883
884      The :c:func:`Py_DecodeLocale` function.
885
886   .. versionchanged:: 3.6
887      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
888
889
890.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
891
892   Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding`
893   and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
894
895   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
896   locale encoding.
897
898   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
899
900   .. versionchanged:: 3.6
901      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
902
903
904.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
905
906   Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
907   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
908   :class:`bytes`. Note that the resulting :class:`bytes` object may contain
909   null bytes.
910
911   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
912   locale encoding.
913
914   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
915   locale encoding and cannot be modified later. If you need to encode a string
916   to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`.
917
918   .. seealso::
919
920      The :c:func:`Py_EncodeLocale` function.
921
922   .. versionadded:: 3.2
923
924   .. versionchanged:: 3.6
925      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
926
927wchar_t Support
928"""""""""""""""
929
930:c:type:`wchar_t` support for platforms which support it:
931
932.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
933
934   Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
935   Passing ``-1`` as the *size* indicates that the function must itself compute the length,
936   using wcslen.
937   Return ``NULL`` on failure.
938
939
940.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size)
941
942   Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
943   *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
944   null termination character).  Return the number of :c:type:`wchar_t` characters
945   copied or ``-1`` in case of an error.  Note that the resulting :c:type:`wchar_t*`
946   string may or may not be null-terminated.  It is the responsibility of the caller
947   to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
948   required by the application. Also, note that the :c:type:`wchar_t*` string
949   might contain null characters, which would cause the string to be truncated
950   when used with most C functions.
951
952
953.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
954
955   Convert the Unicode object to a wide character string. The output string
956   always ends with a null character. If *size* is not ``NULL``, write the number
957   of wide characters (excluding the trailing null termination character) into
958   *\*size*. Note that the resulting :c:type:`wchar_t` string might contain
959   null characters, which would cause the string to be truncated when used with
960   most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string
961   contains null characters a :exc:`ValueError` is raised.
962
963   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
964   :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL``
965   and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation
966   is failed.
967
968   .. versionadded:: 3.2
969
970   .. versionchanged:: 3.7
971      Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*`
972      string contains null characters.
973
974
975.. _builtincodecs:
976
977Built-in Codecs
978^^^^^^^^^^^^^^^
979
980Python provides a set of built-in codecs which are written in C for speed. All of
981these codecs are directly usable via the following functions.
982
983Many of the following APIs take two arguments encoding and errors, and they
984have the same semantics as the ones of the built-in :func:`str` string object
985constructor.
986
987Setting encoding to ``NULL`` causes the default encoding to be used
988which is ASCII.  The file system calls should use
989:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the
990variable :c:data:`Py_FileSystemDefaultEncoding` internally. This
991variable should be treated as read-only: on some systems, it will be a
992pointer to a static string, on others, it will change at run-time
993(such as when the application invokes setlocale).
994
995Error handling is set by errors which may also be set to ``NULL`` meaning to use
996the default handling defined for the codec.  Default error handling for all
997built-in codecs is "strict" (:exc:`ValueError` is raised).
998
999The codecs all use a similar interface.  Only deviation from the following
1000generic ones are documented for simplicity.
1001
1002
1003Generic Codecs
1004""""""""""""""
1005
1006These are the generic codec APIs:
1007
1008
1009.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \
1010                              const char *encoding, const char *errors)
1011
1012   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
1013   *encoding* and *errors* have the same meaning as the parameters of the same name
1014   in the :func:`str` built-in function.  The codec to be used is looked up
1015   using the Python codec registry.  Return ``NULL`` if an exception was raised by
1016   the codec.
1017
1018
1019.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \
1020                              const char *encoding, const char *errors)
1021
1022   Encode a Unicode object and return the result as Python bytes object.
1023   *encoding* and *errors* have the same meaning as the parameters of the same
1024   name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up
1025   using the Python codec registry. Return ``NULL`` if an exception was raised by
1026   the codec.
1027
1028
1029.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \
1030                              const char *encoding, const char *errors)
1031
1032   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
1033   bytes object.  *encoding* and *errors* have the same meaning as the
1034   parameters of the same name in the Unicode :meth:`~str.encode` method.  The codec
1035   to be used is looked up using the Python codec registry.  Return ``NULL`` if an
1036   exception was raised by the codec.
1037
1038   .. deprecated-removed:: 3.3 4.0
1039      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1040      :c:func:`PyUnicode_AsEncodedString`.
1041
1042
1043UTF-8 Codecs
1044""""""""""""
1045
1046These are the UTF-8 codec APIs:
1047
1048
1049.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
1050
1051   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
1052   *s*. Return ``NULL`` if an exception was raised by the codec.
1053
1054
1055.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \
1056                              const char *errors, Py_ssize_t *consumed)
1057
1058   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If
1059   *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be
1060   treated as an error. Those bytes will not be decoded and the number of bytes
1061   that have been decoded will be stored in *consumed*.
1062
1063
1064.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
1065
1066   Encode a Unicode object using UTF-8 and return the result as Python bytes
1067   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1068   raised by the codec.
1069
1070
1071.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
1072
1073   Return a pointer to the UTF-8 encoding of the Unicode object, and
1074   store the size of the encoded representation (in bytes) in *size*.  The
1075   *size* argument can be ``NULL``; in this case no size will be stored.  The
1076   returned buffer always has an extra null byte appended (not included in
1077   *size*), regardless of whether there are any other null code points.
1078
1079   In the case of an error, ``NULL`` is returned with an exception set and no
1080   *size* is stored.
1081
1082   This caches the UTF-8 representation of the string in the Unicode object, and
1083   subsequent calls will return a pointer to the same buffer.  The caller is not
1084   responsible for deallocating the buffer.
1085
1086   .. versionadded:: 3.3
1087
1088   .. versionchanged:: 3.7
1089      The return type is now ``const char *`` rather of ``char *``.
1090
1091
1092.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
1093
1094   As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
1095
1096   .. versionadded:: 3.3
1097
1098   .. versionchanged:: 3.7
1099      The return type is now ``const char *`` rather of ``char *``.
1100
1101
1102.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1103
1104   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and
1105   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1106   the codec.
1107
1108   .. deprecated-removed:: 3.3 4.0
1109      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1110      :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or
1111      :c:func:`PyUnicode_AsEncodedString`.
1112
1113
1114UTF-32 Codecs
1115"""""""""""""
1116
1117These are the UTF-32 codec APIs:
1118
1119
1120.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \
1121                              const char *errors, int *byteorder)
1122
1123   Decode *size* bytes from a UTF-32 encoded buffer string and return the
1124   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1125   handling. It defaults to "strict".
1126
1127   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1128   order::
1129
1130      *byteorder == -1: little endian
1131      *byteorder == 0:  native order
1132      *byteorder == 1:  big endian
1133
1134   If ``*byteorder`` is zero, and the first four bytes of the input data are a
1135   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1136   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1137   ``1``, any byte order mark is copied to the output.
1138
1139   After completion, *\*byteorder* is set to the current byte order at the end
1140   of input data.
1141
1142   If *byteorder* is ``NULL``, the codec starts in native order mode.
1143
1144   Return ``NULL`` if an exception was raised by the codec.
1145
1146
1147.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \
1148                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1149
1150   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If
1151   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
1152   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
1153   by four) as an error. Those bytes will not be decoded and the number of bytes
1154   that have been decoded will be stored in *consumed*.
1155
1156
1157.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
1158
1159   Return a Python byte string using the UTF-32 encoding in native byte
1160   order. The string always starts with a BOM mark.  Error handling is "strict".
1161   Return ``NULL`` if an exception was raised by the codec.
1162
1163
1164.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \
1165                              const char *errors, int byteorder)
1166
1167   Return a Python bytes object holding the UTF-32 encoded value of the Unicode
1168   data in *s*.  Output is written according to the following byte order::
1169
1170      byteorder == -1: little endian
1171      byteorder == 0:  native byte order (writes a BOM mark)
1172      byteorder == 1:  big endian
1173
1174   If byteorder is ``0``, the output string will always start with the Unicode BOM
1175   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1176
1177   If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output
1178   as a single code point.
1179
1180   Return ``NULL`` if an exception was raised by the codec.
1181
1182   .. deprecated-removed:: 3.3 4.0
1183      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1184      :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`.
1185
1186
1187UTF-16 Codecs
1188"""""""""""""
1189
1190These are the UTF-16 codec APIs:
1191
1192
1193.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \
1194                              const char *errors, int *byteorder)
1195
1196   Decode *size* bytes from a UTF-16 encoded buffer string and return the
1197   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1198   handling. It defaults to "strict".
1199
1200   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1201   order::
1202
1203      *byteorder == -1: little endian
1204      *byteorder == 0:  native order
1205      *byteorder == 1:  big endian
1206
1207   If ``*byteorder`` is zero, and the first two bytes of the input data are a
1208   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1209   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1210   ``1``, any byte order mark is copied to the output (where it will result in
1211   either a ``\ufeff`` or a ``\ufffe`` character).
1212
1213   After completion, *\*byteorder* is set to the current byte order at the end
1214   of input data.
1215
1216   If *byteorder* is ``NULL``, the codec starts in native order mode.
1217
1218   Return ``NULL`` if an exception was raised by the codec.
1219
1220
1221.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \
1222                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1223
1224   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If
1225   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
1226   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
1227   split surrogate pair) as an error. Those bytes will not be decoded and the
1228   number of bytes that have been decoded will be stored in *consumed*.
1229
1230
1231.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
1232
1233   Return a Python byte string using the UTF-16 encoding in native byte
1234   order. The string always starts with a BOM mark.  Error handling is "strict".
1235   Return ``NULL`` if an exception was raised by the codec.
1236
1237
1238.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \
1239                              const char *errors, int byteorder)
1240
1241   Return a Python bytes object holding the UTF-16 encoded value of the Unicode
1242   data in *s*.  Output is written according to the following byte order::
1243
1244      byteorder == -1: little endian
1245      byteorder == 0:  native byte order (writes a BOM mark)
1246      byteorder == 1:  big endian
1247
1248   If byteorder is ``0``, the output string will always start with the Unicode BOM
1249   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1250
1251   If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get
1252   represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
1253   values is interpreted as a UCS-2 character.
1254
1255   Return ``NULL`` if an exception was raised by the codec.
1256
1257   .. deprecated-removed:: 3.3 4.0
1258      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1259      :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`.
1260
1261
1262UTF-7 Codecs
1263""""""""""""
1264
1265These are the UTF-7 codec APIs:
1266
1267
1268.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
1269
1270   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
1271   *s*.  Return ``NULL`` if an exception was raised by the codec.
1272
1273
1274.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \
1275                              const char *errors, Py_ssize_t *consumed)
1276
1277   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
1278   *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not
1279   be treated as an error.  Those bytes will not be decoded and the number of
1280   bytes that have been decoded will be stored in *consumed*.
1281
1282
1283.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \
1284                              int base64SetO, int base64WhiteSpace, const char *errors)
1285
1286   Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
1287   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1288   the codec.
1289
1290   If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
1291   special meaning) will be encoded in base-64.  If *base64WhiteSpace* is
1292   nonzero, whitespace will be encoded in base-64.  Both are set to zero for the
1293   Python "utf-7" codec.
1294
1295   .. deprecated-removed:: 3.3 4.0
1296      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1297      :c:func:`PyUnicode_AsEncodedString`.
1298
1299
1300Unicode-Escape Codecs
1301"""""""""""""""""""""
1302
1303These are the "Unicode Escape" codec APIs:
1304
1305
1306.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \
1307                              Py_ssize_t size, const char *errors)
1308
1309   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
1310   string *s*.  Return ``NULL`` if an exception was raised by the codec.
1311
1312
1313.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
1314
1315   Encode a Unicode object using Unicode-Escape and return the result as a
1316   bytes object.  Error handling is "strict".  Return ``NULL`` if an exception was
1317   raised by the codec.
1318
1319
1320.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
1321
1322   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
1323   return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1324
1325   .. deprecated-removed:: 3.3 4.0
1326      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1327      :c:func:`PyUnicode_AsUnicodeEscapeString`.
1328
1329
1330Raw-Unicode-Escape Codecs
1331"""""""""""""""""""""""""
1332
1333These are the "Raw Unicode Escape" codec APIs:
1334
1335
1336.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \
1337                              Py_ssize_t size, const char *errors)
1338
1339   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
1340   encoded string *s*.  Return ``NULL`` if an exception was raised by the codec.
1341
1342
1343.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
1344
1345   Encode a Unicode object using Raw-Unicode-Escape and return the result as
1346   a bytes object.  Error handling is "strict".  Return ``NULL`` if an exception
1347   was raised by the codec.
1348
1349
1350.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \
1351                              Py_ssize_t size)
1352
1353   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
1354   and return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1355
1356   .. deprecated-removed:: 3.3 4.0
1357      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1358      :c:func:`PyUnicode_AsRawUnicodeEscapeString` or
1359      :c:func:`PyUnicode_AsEncodedString`.
1360
1361
1362Latin-1 Codecs
1363""""""""""""""
1364
1365These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
1366ordinals and only these are accepted by the codecs during encoding.
1367
1368
1369.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
1370
1371   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
1372   *s*.  Return ``NULL`` if an exception was raised by the codec.
1373
1374
1375.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
1376
1377   Encode a Unicode object using Latin-1 and return the result as Python bytes
1378   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1379   raised by the codec.
1380
1381
1382.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1383
1384   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and
1385   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1386   the codec.
1387
1388   .. deprecated-removed:: 3.3 4.0
1389      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1390      :c:func:`PyUnicode_AsLatin1String` or
1391      :c:func:`PyUnicode_AsEncodedString`.
1392
1393
1394ASCII Codecs
1395""""""""""""
1396
1397These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
1398codes generate errors.
1399
1400
1401.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
1402
1403   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
1404   *s*.  Return ``NULL`` if an exception was raised by the codec.
1405
1406
1407.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
1408
1409   Encode a Unicode object using ASCII and return the result as Python bytes
1410   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1411   raised by the codec.
1412
1413
1414.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1415
1416   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and
1417   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1418   the codec.
1419
1420   .. deprecated-removed:: 3.3 4.0
1421      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1422      :c:func:`PyUnicode_AsASCIIString` or
1423      :c:func:`PyUnicode_AsEncodedString`.
1424
1425
1426Character Map Codecs
1427""""""""""""""""""""
1428
1429This codec is special in that it can be used to implement many different codecs
1430(and this is in fact what was done to obtain most of the standard codecs
1431included in the :mod:`encodings` package). The codec uses mapping to encode and
1432decode characters.  The mapping objects provided must support the
1433:meth:`__getitem__` mapping interface; dictionaries and sequences work well.
1434
1435These are the mapping codec APIs:
1436
1437.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \
1438                              PyObject *mapping, const char *errors)
1439
1440   Create a Unicode object by decoding *size* bytes of the encoded string *s*
1441   using the given *mapping* object.  Return ``NULL`` if an exception was raised
1442   by the codec.
1443
1444   If *mapping* is ``NULL``, Latin-1 decoding will be applied.  Else
1445   *mapping* must map bytes ordinals (integers in the range from 0 to 255)
1446   to Unicode strings, integers (which are then interpreted as Unicode
1447   ordinals) or ``None``.  Unmapped data bytes -- ones which cause a
1448   :exc:`LookupError`, as well as ones which get mapped to ``None``,
1449   ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause
1450   an error.
1451
1452
1453.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
1454
1455   Encode a Unicode object using the given *mapping* object and return the
1456   result as a bytes object.  Error handling is "strict".  Return ``NULL`` if an
1457   exception was raised by the codec.
1458
1459   The *mapping* object must map Unicode ordinal integers to bytes objects,
1460   integers in the range from 0 to 255 or ``None``.  Unmapped character
1461   ordinals (ones which cause a :exc:`LookupError`) as well as mapped to
1462   ``None`` are treated as "undefined mapping" and cause an error.
1463
1464
1465.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1466                              PyObject *mapping, const char *errors)
1467
1468   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
1469   *mapping* object and return the result as a bytes object.  Return ``NULL`` if
1470   an exception was raised by the codec.
1471
1472   .. deprecated-removed:: 3.3 4.0
1473      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1474      :c:func:`PyUnicode_AsCharmapString` or
1475      :c:func:`PyUnicode_AsEncodedString`.
1476
1477
1478The following codec API is special in that maps Unicode to Unicode.
1479
1480.. c:function:: PyObject* PyUnicode_Translate(PyObject *unicode, \
1481                              PyObject *mapping, const char *errors)
1482
1483   Translate a Unicode object using the given *mapping* object and return the
1484   resulting Unicode object.  Return ``NULL`` if an exception was raised by the
1485   codec.
1486
1487   The *mapping* object must map Unicode ordinal integers to Unicode strings,
1488   integers (which are then interpreted as Unicode ordinals) or ``None``
1489   (causing deletion of the character).  Unmapped character ordinals (ones
1490   which cause a :exc:`LookupError`) are left untouched and are copied as-is.
1491
1492
1493.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1494                              PyObject *mapping, const char *errors)
1495
1496   Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
1497   character *mapping* table to it and return the resulting Unicode object.
1498   Return ``NULL`` when an exception was raised by the codec.
1499
1500   .. deprecated-removed:: 3.3 4.0
1501      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1502      :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API
1503      <codec-registry>`
1504
1505
1506MBCS codecs for Windows
1507"""""""""""""""""""""""
1508
1509These are the MBCS codec APIs. They are currently only available on Windows and
1510use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
1511DBCS) is a class of encodings, not just one.  The target encoding is defined by
1512the user settings on the machine running the codec.
1513
1514.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
1515
1516   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
1517   Return ``NULL`` if an exception was raised by the codec.
1518
1519
1520.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \
1521                              const char *errors, Py_ssize_t *consumed)
1522
1523   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If
1524   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
1525   trailing lead byte and the number of bytes that have been decoded will be stored
1526   in *consumed*.
1527
1528
1529.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
1530
1531   Encode a Unicode object using MBCS and return the result as Python bytes
1532   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1533   raised by the codec.
1534
1535
1536.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors)
1537
1538   Encode the Unicode object using the specified code page and return a Python
1539   bytes object.  Return ``NULL`` if an exception was raised by the codec. Use
1540   :c:data:`CP_ACP` code page to get the MBCS encoder.
1541
1542   .. versionadded:: 3.3
1543
1544
1545.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1546
1547   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return
1548   a Python bytes object.  Return ``NULL`` if an exception was raised by the
1549   codec.
1550
1551   .. deprecated-removed:: 3.3 4.0
1552      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1553      :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or
1554      :c:func:`PyUnicode_AsEncodedString`.
1555
1556
1557Methods & Slots
1558"""""""""""""""
1559
1560
1561.. _unicodemethodsandslots:
1562
1563Methods and Slot Functions
1564^^^^^^^^^^^^^^^^^^^^^^^^^^
1565
1566The following APIs are capable of handling Unicode objects and strings on input
1567(we refer to them as strings in the descriptions) and return Unicode objects or
1568integers as appropriate.
1569
1570They all return ``NULL`` or ``-1`` if an exception occurs.
1571
1572
1573.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
1574
1575   Concat two strings giving a new Unicode string.
1576
1577
1578.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
1579
1580   Split a string giving a list of Unicode strings.  If *sep* is ``NULL``, splitting
1581   will be done at all whitespace substrings.  Otherwise, splits occur at the given
1582   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
1583   set.  Separators are not included in the resulting list.
1584
1585
1586.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1587
1588   Split a Unicode string at line breaks, returning a list of Unicode strings.
1589   CRLF is considered to be one line break.  If *keepend* is ``0``, the Line break
1590   characters are not included in the resulting strings.
1591
1592
1593.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, \
1594                              const char *errors)
1595
1596   Translate a string by applying a character mapping table to it and return the
1597   resulting Unicode object.
1598
1599   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1600   or ``None`` (causing deletion of the character).
1601
1602   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1603   and sequences work well.  Unmapped character ordinals (ones which cause a
1604   :exc:`LookupError`) are left untouched and are copied as-is.
1605
1606   *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to
1607   use the default error handling.
1608
1609
1610.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1611
1612   Join a sequence of strings using the given *separator* and return the resulting
1613   Unicode string.
1614
1615
1616.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
1617                        Py_ssize_t start, Py_ssize_t end, int direction)
1618
1619   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
1620   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
1621   ``0`` otherwise. Return ``-1`` if an error occurred.
1622
1623
1624.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \
1625                               Py_ssize_t start, Py_ssize_t end, int direction)
1626
1627   Return the first position of *substr* in ``str[start:end]`` using the given
1628   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
1629   backward search).  The return value is the index of the first match; a value of
1630   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1631   occurred and an exception has been set.
1632
1633
1634.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \
1635                               Py_ssize_t start, Py_ssize_t end, int direction)
1636
1637   Return the first position of the character *ch* in ``str[start:end]`` using
1638   the given *direction* (*direction* == ``1`` means to do a forward search,
1639   *direction* == ``-1`` a backward search).  The return value is the index of the
1640   first match; a value of ``-1`` indicates that no match was found, and ``-2``
1641   indicates that an error occurred and an exception has been set.
1642
1643   .. versionadded:: 3.3
1644
1645   .. versionchanged:: 3.7
1646      *start* and *end* are now adjusted to behave like ``str[start:end]``.
1647
1648
1649.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \
1650                               Py_ssize_t start, Py_ssize_t end)
1651
1652   Return the number of non-overlapping occurrences of *substr* in
1653   ``str[start:end]``.  Return ``-1`` if an error occurred.
1654
1655
1656.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \
1657                              PyObject *replstr, Py_ssize_t maxcount)
1658
1659   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1660   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
1661   occurrences.
1662
1663
1664.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1665
1666   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
1667   respectively.
1668
1669   This function returns ``-1`` upon failure, so one should call
1670   :c:func:`PyErr_Occurred` to check for errors.
1671
1672
1673.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)
1674
1675   Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
1676   than, equal, and greater than, respectively. It is best to pass only
1677   ASCII-encoded strings, but the function interprets the input string as
1678   ISO-8859-1 if it contains non-ASCII characters.
1679
1680   This function does not raise exceptions.
1681
1682
1683.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
1684
1685   Rich compare two Unicode strings and return one of the following:
1686
1687   * ``NULL`` in case an exception was raised
1688   * :const:`Py_True` or :const:`Py_False` for successful comparisons
1689   * :const:`Py_NotImplemented` in case the type combination is unknown
1690
1691   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1692   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1693
1694
1695.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1696
1697   Return a new string object from *format* and *args*; this is analogous to
1698   ``format % args``.
1699
1700
1701.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1702
1703   Check whether *element* is contained in *container* and return true or false
1704   accordingly.
1705
1706   *element* has to coerce to a one element Unicode string. ``-1`` is returned
1707   if there was an error.
1708
1709
1710.. c:function:: void PyUnicode_InternInPlace(PyObject **string)
1711
1712   Intern the argument *\*string* in place.  The argument must be the address of a
1713   pointer variable pointing to a Python Unicode string object.  If there is an
1714   existing interned string that is the same as *\*string*, it sets *\*string* to
1715   it (decrementing the reference count of the old string object and incrementing
1716   the reference count of the interned string object), otherwise it leaves
1717   *\*string* alone and interns it (incrementing its reference count).
1718   (Clarification: even though there is a lot of talk about reference counts, think
1719   of this function as reference-count-neutral; you own the object after the call
1720   if and only if you owned it before the call.)
1721
1722
1723.. c:function:: PyObject* PyUnicode_InternFromString(const char *v)
1724
1725   A combination of :c:func:`PyUnicode_FromString` and
1726   :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string
1727   object that has been interned, or a new ("owned") reference to an earlier
1728   interned string object with the same value.
1729