17db96d56Sopenharmony_ci.. highlight:: c
27db96d56Sopenharmony_ci
37db96d56Sopenharmony_ci.. _unicodeobjects:
47db96d56Sopenharmony_ci
57db96d56Sopenharmony_ciUnicode Objects and Codecs
67db96d56Sopenharmony_ci--------------------------
77db96d56Sopenharmony_ci
87db96d56Sopenharmony_ci.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
97db96d56Sopenharmony_ci.. sectionauthor:: Georg Brandl <georg@python.org>
107db96d56Sopenharmony_ci
117db96d56Sopenharmony_ciUnicode Objects
127db96d56Sopenharmony_ci^^^^^^^^^^^^^^^
137db96d56Sopenharmony_ci
147db96d56Sopenharmony_ciSince the implementation of :pep:`393` in Python 3.3, Unicode objects internally
157db96d56Sopenharmony_ciuse a variety of representations, in order to allow handling the complete range
167db96d56Sopenharmony_ciof Unicode characters while staying memory efficient.  There are special cases
177db96d56Sopenharmony_cifor strings where all code points are below 128, 256, or 65536; otherwise, code
187db96d56Sopenharmony_cipoints must be below 1114112 (which is the full Unicode range).
197db96d56Sopenharmony_ci
207db96d56Sopenharmony_ci:c:expr:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
217db96d56Sopenharmony_ciin the Unicode object.  The :c:expr:`Py_UNICODE*` representation is deprecated
227db96d56Sopenharmony_ciand inefficient.
237db96d56Sopenharmony_ci
247db96d56Sopenharmony_ciDue to the transition between the old APIs and the new APIs, Unicode objects
257db96d56Sopenharmony_cican internally be in two states depending on how they were created:
267db96d56Sopenharmony_ci
277db96d56Sopenharmony_ci* "canonical" Unicode objects are all objects created by a non-deprecated
287db96d56Sopenharmony_ci  Unicode API.  They use the most efficient representation allowed by the
297db96d56Sopenharmony_ci  implementation.
307db96d56Sopenharmony_ci
317db96d56Sopenharmony_ci* "legacy" Unicode objects have been created through one of the deprecated
327db96d56Sopenharmony_ci  APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
337db96d56Sopenharmony_ci  :c:expr:`Py_UNICODE*` representation; you will have to call
347db96d56Sopenharmony_ci  :c:func:`PyUnicode_READY` on them before calling any other API.
357db96d56Sopenharmony_ci
367db96d56Sopenharmony_ci.. note::
377db96d56Sopenharmony_ci   The "legacy" Unicode object will be removed in Python 3.12 with deprecated
387db96d56Sopenharmony_ci   APIs. All Unicode objects will be "canonical" since then. See :pep:`623`
397db96d56Sopenharmony_ci   for more information.
407db96d56Sopenharmony_ci
417db96d56Sopenharmony_ci
427db96d56Sopenharmony_ciUnicode Type
437db96d56Sopenharmony_ci""""""""""""
447db96d56Sopenharmony_ci
457db96d56Sopenharmony_ciThese are the basic Unicode object types used for the Unicode implementation in
467db96d56Sopenharmony_ciPython:
477db96d56Sopenharmony_ci
487db96d56Sopenharmony_ci.. c:type:: Py_UCS4
497db96d56Sopenharmony_ci            Py_UCS2
507db96d56Sopenharmony_ci            Py_UCS1
517db96d56Sopenharmony_ci
527db96d56Sopenharmony_ci   These types are typedefs for unsigned integer types wide enough to contain
537db96d56Sopenharmony_ci   characters of 32 bits, 16 bits and 8 bits, respectively.  When dealing with
547db96d56Sopenharmony_ci   single Unicode characters, use :c:type:`Py_UCS4`.
557db96d56Sopenharmony_ci
567db96d56Sopenharmony_ci   .. versionadded:: 3.3
577db96d56Sopenharmony_ci
587db96d56Sopenharmony_ci
597db96d56Sopenharmony_ci.. c:type:: Py_UNICODE
607db96d56Sopenharmony_ci
617db96d56Sopenharmony_ci   This is a typedef of :c:expr:`wchar_t`, which is a 16-bit type or 32-bit type
627db96d56Sopenharmony_ci   depending on the platform.
637db96d56Sopenharmony_ci
647db96d56Sopenharmony_ci   .. versionchanged:: 3.3
657db96d56Sopenharmony_ci      In previous versions, this was a 16-bit type or a 32-bit type depending on
667db96d56Sopenharmony_ci      whether you selected a "narrow" or "wide" Unicode version of Python at
677db96d56Sopenharmony_ci      build time.
687db96d56Sopenharmony_ci
697db96d56Sopenharmony_ci
707db96d56Sopenharmony_ci.. c:type:: PyASCIIObject
717db96d56Sopenharmony_ci            PyCompactUnicodeObject
727db96d56Sopenharmony_ci            PyUnicodeObject
737db96d56Sopenharmony_ci
747db96d56Sopenharmony_ci   These subtypes of :c:type:`PyObject` represent a Python Unicode object.  In
757db96d56Sopenharmony_ci   almost all cases, they shouldn't be used directly, since all API functions
767db96d56Sopenharmony_ci   that deal with Unicode objects take and return :c:type:`PyObject` pointers.
777db96d56Sopenharmony_ci
787db96d56Sopenharmony_ci   .. versionadded:: 3.3
797db96d56Sopenharmony_ci
807db96d56Sopenharmony_ci
817db96d56Sopenharmony_ci.. c:var:: PyTypeObject PyUnicode_Type
827db96d56Sopenharmony_ci
837db96d56Sopenharmony_ci   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
847db96d56Sopenharmony_ci   is exposed to Python code as ``str``.
857db96d56Sopenharmony_ci
867db96d56Sopenharmony_ci
877db96d56Sopenharmony_ciThe following APIs are C macros and static inlined functions for fast checks and
887db96d56Sopenharmony_ciaccess to internal read-only data of Unicode objects:
897db96d56Sopenharmony_ci
907db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Check(PyObject *o)
917db96d56Sopenharmony_ci
927db96d56Sopenharmony_ci   Return true if the object *o* is a Unicode object or an instance of a Unicode
937db96d56Sopenharmony_ci   subtype.  This function always succeeds.
947db96d56Sopenharmony_ci
957db96d56Sopenharmony_ci
967db96d56Sopenharmony_ci.. c:function:: int PyUnicode_CheckExact(PyObject *o)
977db96d56Sopenharmony_ci
987db96d56Sopenharmony_ci   Return true if the object *o* is a Unicode object, but not an instance of a
997db96d56Sopenharmony_ci   subtype.  This function always succeeds.
1007db96d56Sopenharmony_ci
1017db96d56Sopenharmony_ci
1027db96d56Sopenharmony_ci.. c:function:: int PyUnicode_READY(PyObject *o)
1037db96d56Sopenharmony_ci
1047db96d56Sopenharmony_ci   Ensure the string object *o* is in the "canonical" representation.  This is
1057db96d56Sopenharmony_ci   required before using any of the access macros described below.
1067db96d56Sopenharmony_ci
1077db96d56Sopenharmony_ci   .. XXX expand on when it is not required
1087db96d56Sopenharmony_ci
1097db96d56Sopenharmony_ci   Returns ``0`` on success and ``-1`` with an exception set on failure, which in
1107db96d56Sopenharmony_ci   particular happens if memory allocation fails.
1117db96d56Sopenharmony_ci
1127db96d56Sopenharmony_ci   .. versionadded:: 3.3
1137db96d56Sopenharmony_ci
1147db96d56Sopenharmony_ci   .. deprecated-removed:: 3.10 3.12
1157db96d56Sopenharmony_ci      This API will be removed with :c:func:`PyUnicode_FromUnicode`.
1167db96d56Sopenharmony_ci
1177db96d56Sopenharmony_ci
1187db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
1197db96d56Sopenharmony_ci
1207db96d56Sopenharmony_ci   Return the length of the Unicode string, in code points.  *o* has to be a
1217db96d56Sopenharmony_ci   Unicode object in the "canonical" representation (not checked).
1227db96d56Sopenharmony_ci
1237db96d56Sopenharmony_ci   .. versionadded:: 3.3
1247db96d56Sopenharmony_ci
1257db96d56Sopenharmony_ci
1267db96d56Sopenharmony_ci.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
1277db96d56Sopenharmony_ci                Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
1287db96d56Sopenharmony_ci                Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)
1297db96d56Sopenharmony_ci
1307db96d56Sopenharmony_ci   Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
1317db96d56Sopenharmony_ci   integer types for direct character access.  No checks are performed if the
1327db96d56Sopenharmony_ci   canonical representation has the correct character size; use
1337db96d56Sopenharmony_ci   :c:func:`PyUnicode_KIND` to select the right macro.  Make sure
1347db96d56Sopenharmony_ci   :c:func:`PyUnicode_READY` has been called before accessing this.
1357db96d56Sopenharmony_ci
1367db96d56Sopenharmony_ci   .. versionadded:: 3.3
1377db96d56Sopenharmony_ci
1387db96d56Sopenharmony_ci
1397db96d56Sopenharmony_ci.. c:macro:: PyUnicode_WCHAR_KIND
1407db96d56Sopenharmony_ci             PyUnicode_1BYTE_KIND
1417db96d56Sopenharmony_ci             PyUnicode_2BYTE_KIND
1427db96d56Sopenharmony_ci             PyUnicode_4BYTE_KIND
1437db96d56Sopenharmony_ci
1447db96d56Sopenharmony_ci   Return values of the :c:func:`PyUnicode_KIND` macro.
1457db96d56Sopenharmony_ci
1467db96d56Sopenharmony_ci   .. versionadded:: 3.3
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci   .. deprecated-removed:: 3.10 3.12
1497db96d56Sopenharmony_ci      ``PyUnicode_WCHAR_KIND`` is deprecated.
1507db96d56Sopenharmony_ci
1517db96d56Sopenharmony_ci
1527db96d56Sopenharmony_ci.. c:function:: int PyUnicode_KIND(PyObject *o)
1537db96d56Sopenharmony_ci
1547db96d56Sopenharmony_ci   Return one of the PyUnicode kind constants (see above) that indicate how many
1557db96d56Sopenharmony_ci   bytes per character this Unicode object uses to store its data.  *o* has to
1567db96d56Sopenharmony_ci   be a Unicode object in the "canonical" representation (not checked).
1577db96d56Sopenharmony_ci
1587db96d56Sopenharmony_ci   .. XXX document "0" return value?
1597db96d56Sopenharmony_ci
1607db96d56Sopenharmony_ci   .. versionadded:: 3.3
1617db96d56Sopenharmony_ci
1627db96d56Sopenharmony_ci
1637db96d56Sopenharmony_ci.. c:function:: void* PyUnicode_DATA(PyObject *o)
1647db96d56Sopenharmony_ci
1657db96d56Sopenharmony_ci   Return a void pointer to the raw Unicode buffer.  *o* has to be a Unicode
1667db96d56Sopenharmony_ci   object in the "canonical" representation (not checked).
1677db96d56Sopenharmony_ci
1687db96d56Sopenharmony_ci   .. versionadded:: 3.3
1697db96d56Sopenharmony_ci
1707db96d56Sopenharmony_ci
1717db96d56Sopenharmony_ci.. c:function:: void PyUnicode_WRITE(int kind, void *data, \
1727db96d56Sopenharmony_ci                                     Py_ssize_t index, Py_UCS4 value)
1737db96d56Sopenharmony_ci
1747db96d56Sopenharmony_ci   Write into a canonical representation *data* (as obtained with
1757db96d56Sopenharmony_ci   :c:func:`PyUnicode_DATA`).  This function performs no sanity checks, and is
1767db96d56Sopenharmony_ci   intended for usage in loops.  The caller should cache the *kind* value and
1777db96d56Sopenharmony_ci   *data* pointer as obtained from other calls.  *index* is the index in
1787db96d56Sopenharmony_ci   the string (starts at 0) and *value* is the new code point value which should
1797db96d56Sopenharmony_ci   be written to that location.
1807db96d56Sopenharmony_ci
1817db96d56Sopenharmony_ci   .. versionadded:: 3.3
1827db96d56Sopenharmony_ci
1837db96d56Sopenharmony_ci
1847db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, \
1857db96d56Sopenharmony_ci                                       Py_ssize_t index)
1867db96d56Sopenharmony_ci
1877db96d56Sopenharmony_ci   Read a code point from a canonical representation *data* (as obtained with
1887db96d56Sopenharmony_ci   :c:func:`PyUnicode_DATA`).  No checks or ready calls are performed.
1897db96d56Sopenharmony_ci
1907db96d56Sopenharmony_ci   .. versionadded:: 3.3
1917db96d56Sopenharmony_ci
1927db96d56Sopenharmony_ci
1937db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)
1947db96d56Sopenharmony_ci
1957db96d56Sopenharmony_ci   Read a character from a Unicode object *o*, which must be in the "canonical"
1967db96d56Sopenharmony_ci   representation.  This is less efficient than :c:func:`PyUnicode_READ` if you
1977db96d56Sopenharmony_ci   do multiple consecutive reads.
1987db96d56Sopenharmony_ci
1997db96d56Sopenharmony_ci   .. versionadded:: 3.3
2007db96d56Sopenharmony_ci
2017db96d56Sopenharmony_ci
2027db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_MAX_CHAR_VALUE(PyObject *o)
2037db96d56Sopenharmony_ci
2047db96d56Sopenharmony_ci   Return the maximum code point that is suitable for creating another string
2057db96d56Sopenharmony_ci   based on *o*, which must be in the "canonical" representation.  This is
2067db96d56Sopenharmony_ci   always an approximation but more efficient than iterating over the string.
2077db96d56Sopenharmony_ci
2087db96d56Sopenharmony_ci   .. versionadded:: 3.3
2097db96d56Sopenharmony_ci
2107db96d56Sopenharmony_ci
2117db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
2127db96d56Sopenharmony_ci
2137db96d56Sopenharmony_ci   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
2147db96d56Sopenharmony_ci   code units (this includes surrogate pairs as 2 units).  *o* has to be a
2157db96d56Sopenharmony_ci   Unicode object (not checked).
2167db96d56Sopenharmony_ci
2177db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
2187db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
2197db96d56Sopenharmony_ci      :c:func:`PyUnicode_GET_LENGTH`.
2207db96d56Sopenharmony_ci
2217db96d56Sopenharmony_ci
2227db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
2237db96d56Sopenharmony_ci
2247db96d56Sopenharmony_ci   Return the size of the deprecated :c:type:`Py_UNICODE` representation in
2257db96d56Sopenharmony_ci   bytes.  *o* has to be a Unicode object (not checked).
2267db96d56Sopenharmony_ci
2277db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
2287db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
2297db96d56Sopenharmony_ci      :c:func:`PyUnicode_GET_LENGTH`.
2307db96d56Sopenharmony_ci
2317db96d56Sopenharmony_ci
2327db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
2337db96d56Sopenharmony_ci                const char* PyUnicode_AS_DATA(PyObject *o)
2347db96d56Sopenharmony_ci
2357db96d56Sopenharmony_ci   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
2367db96d56Sopenharmony_ci   returned buffer is always terminated with an extra null code point.  It
2377db96d56Sopenharmony_ci   may also contain embedded null code points, which would cause the string
2387db96d56Sopenharmony_ci   to be truncated when used in most C functions.  The ``AS_DATA`` form
2397db96d56Sopenharmony_ci   casts the pointer to :c:expr:`const char *`.  The *o* argument has to be
2407db96d56Sopenharmony_ci   a Unicode object (not checked).
2417db96d56Sopenharmony_ci
2427db96d56Sopenharmony_ci   .. versionchanged:: 3.3
2437db96d56Sopenharmony_ci      This function is now inefficient -- because in many cases the
2447db96d56Sopenharmony_ci      :c:type:`Py_UNICODE` representation does not exist and needs to be created
2457db96d56Sopenharmony_ci      -- and can fail (return ``NULL`` with an exception set).  Try to port the
2467db96d56Sopenharmony_ci      code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
2477db96d56Sopenharmony_ci      :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.
2487db96d56Sopenharmony_ci
2497db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
2507db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using the
2517db96d56Sopenharmony_ci      :c:func:`PyUnicode_nBYTE_DATA` family of macros.
2527db96d56Sopenharmony_ci
2537db96d56Sopenharmony_ci
2547db96d56Sopenharmony_ci.. c:function:: int PyUnicode_IsIdentifier(PyObject *o)
2557db96d56Sopenharmony_ci
2567db96d56Sopenharmony_ci   Return ``1`` if the string is a valid identifier according to the language
2577db96d56Sopenharmony_ci   definition, section :ref:`identifiers`. Return ``0`` otherwise.
2587db96d56Sopenharmony_ci
2597db96d56Sopenharmony_ci   .. versionchanged:: 3.9
2607db96d56Sopenharmony_ci      The function does not call :c:func:`Py_FatalError` anymore if the string
2617db96d56Sopenharmony_ci      is not ready.
2627db96d56Sopenharmony_ci
2637db96d56Sopenharmony_ci
2647db96d56Sopenharmony_ciUnicode Character Properties
2657db96d56Sopenharmony_ci""""""""""""""""""""""""""""
2667db96d56Sopenharmony_ci
2677db96d56Sopenharmony_ciUnicode provides many different character properties. The most often needed ones
2687db96d56Sopenharmony_ciare available through these macros which are mapped to C functions depending on
2697db96d56Sopenharmony_cithe Python configuration.
2707db96d56Sopenharmony_ci
2717db96d56Sopenharmony_ci
2727db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISSPACE(Py_UCS4 ch)
2737db96d56Sopenharmony_ci
2747db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
2757db96d56Sopenharmony_ci
2767db96d56Sopenharmony_ci
2777db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISLOWER(Py_UCS4 ch)
2787db96d56Sopenharmony_ci
2797db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
2807db96d56Sopenharmony_ci
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISUPPER(Py_UCS4 ch)
2837db96d56Sopenharmony_ci
2847db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
2857db96d56Sopenharmony_ci
2867db96d56Sopenharmony_ci
2877db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISTITLE(Py_UCS4 ch)
2887db96d56Sopenharmony_ci
2897db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
2907db96d56Sopenharmony_ci
2917db96d56Sopenharmony_ci
2927db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UCS4 ch)
2937db96d56Sopenharmony_ci
2947db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
2957db96d56Sopenharmony_ci
2967db96d56Sopenharmony_ci
2977db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UCS4 ch)
2987db96d56Sopenharmony_ci
2997db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
3007db96d56Sopenharmony_ci
3017db96d56Sopenharmony_ci
3027db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISDIGIT(Py_UCS4 ch)
3037db96d56Sopenharmony_ci
3047db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
3057db96d56Sopenharmony_ci
3067db96d56Sopenharmony_ci
3077db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UCS4 ch)
3087db96d56Sopenharmony_ci
3097db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
3107db96d56Sopenharmony_ci
3117db96d56Sopenharmony_ci
3127db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISALPHA(Py_UCS4 ch)
3137db96d56Sopenharmony_ci
3147db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
3157db96d56Sopenharmony_ci
3167db96d56Sopenharmony_ci
3177db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISALNUM(Py_UCS4 ch)
3187db96d56Sopenharmony_ci
3197db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
3207db96d56Sopenharmony_ci
3217db96d56Sopenharmony_ci
3227db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UCS4 ch)
3237db96d56Sopenharmony_ci
3247db96d56Sopenharmony_ci   Return ``1`` or ``0`` depending on whether *ch* is a printable character.
3257db96d56Sopenharmony_ci   Nonprintable characters are those characters defined in the Unicode character
3267db96d56Sopenharmony_ci   database as "Other" or "Separator", excepting the ASCII space (0x20) which is
3277db96d56Sopenharmony_ci   considered printable.  (Note that printable characters in this context are
3287db96d56Sopenharmony_ci   those which should not be escaped when :func:`repr` is invoked on a string.
3297db96d56Sopenharmony_ci   It has no bearing on the handling of strings written to :data:`sys.stdout` or
3307db96d56Sopenharmony_ci   :data:`sys.stderr`.)
3317db96d56Sopenharmony_ci
3327db96d56Sopenharmony_ci
3337db96d56Sopenharmony_ciThese APIs can be used for fast direct character conversions:
3347db96d56Sopenharmony_ci
3357db96d56Sopenharmony_ci
3367db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOLOWER(Py_UCS4 ch)
3377db96d56Sopenharmony_ci
3387db96d56Sopenharmony_ci   Return the character *ch* converted to lower case.
3397db96d56Sopenharmony_ci
3407db96d56Sopenharmony_ci   .. deprecated:: 3.3
3417db96d56Sopenharmony_ci      This function uses simple case mappings.
3427db96d56Sopenharmony_ci
3437db96d56Sopenharmony_ci
3447db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOUPPER(Py_UCS4 ch)
3457db96d56Sopenharmony_ci
3467db96d56Sopenharmony_ci   Return the character *ch* converted to upper case.
3477db96d56Sopenharmony_ci
3487db96d56Sopenharmony_ci   .. deprecated:: 3.3
3497db96d56Sopenharmony_ci      This function uses simple case mappings.
3507db96d56Sopenharmony_ci
3517db96d56Sopenharmony_ci
3527db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOTITLE(Py_UCS4 ch)
3537db96d56Sopenharmony_ci
3547db96d56Sopenharmony_ci   Return the character *ch* converted to title case.
3557db96d56Sopenharmony_ci
3567db96d56Sopenharmony_ci   .. deprecated:: 3.3
3577db96d56Sopenharmony_ci      This function uses simple case mappings.
3587db96d56Sopenharmony_ci
3597db96d56Sopenharmony_ci
3607db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_TODECIMAL(Py_UCS4 ch)
3617db96d56Sopenharmony_ci
3627db96d56Sopenharmony_ci   Return the character *ch* converted to a decimal positive integer.  Return
3637db96d56Sopenharmony_ci   ``-1`` if this is not possible.  This macro does not raise exceptions.
3647db96d56Sopenharmony_ci
3657db96d56Sopenharmony_ci
3667db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_TODIGIT(Py_UCS4 ch)
3677db96d56Sopenharmony_ci
3687db96d56Sopenharmony_ci   Return the character *ch* converted to a single digit integer. Return ``-1`` if
3697db96d56Sopenharmony_ci   this is not possible.  This macro does not raise exceptions.
3707db96d56Sopenharmony_ci
3717db96d56Sopenharmony_ci
3727db96d56Sopenharmony_ci.. c:function:: double Py_UNICODE_TONUMERIC(Py_UCS4 ch)
3737db96d56Sopenharmony_ci
3747db96d56Sopenharmony_ci   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
3757db96d56Sopenharmony_ci   possible.  This macro does not raise exceptions.
3767db96d56Sopenharmony_ci
3777db96d56Sopenharmony_ci
3787db96d56Sopenharmony_ciThese APIs can be used to work with surrogates:
3797db96d56Sopenharmony_ci
3807db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_SURROGATE(ch)
3817db96d56Sopenharmony_ci
3827db96d56Sopenharmony_ci   Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``).
3837db96d56Sopenharmony_ci
3847db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch)
3857db96d56Sopenharmony_ci
3867db96d56Sopenharmony_ci   Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``).
3877db96d56Sopenharmony_ci
3887db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch)
3897db96d56Sopenharmony_ci
3907db96d56Sopenharmony_ci   Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``).
3917db96d56Sopenharmony_ci
3927db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low)
3937db96d56Sopenharmony_ci
3947db96d56Sopenharmony_ci   Join two surrogate characters and return a single Py_UCS4 value.
3957db96d56Sopenharmony_ci   *high* and *low* are respectively the leading and trailing surrogates in a
3967db96d56Sopenharmony_ci   surrogate pair.
3977db96d56Sopenharmony_ci
3987db96d56Sopenharmony_ci
3997db96d56Sopenharmony_ciCreating and accessing Unicode strings
4007db96d56Sopenharmony_ci""""""""""""""""""""""""""""""""""""""
4017db96d56Sopenharmony_ci
4027db96d56Sopenharmony_ciTo create Unicode objects and access their basic sequence properties, use these
4037db96d56Sopenharmony_ciAPIs:
4047db96d56Sopenharmony_ci
4057db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
4067db96d56Sopenharmony_ci
4077db96d56Sopenharmony_ci   Create a new Unicode object.  *maxchar* should be the true maximum code point
4087db96d56Sopenharmony_ci   to be placed in the string.  As an approximation, it can be rounded up to the
4097db96d56Sopenharmony_ci   nearest value in the sequence 127, 255, 65535, 1114111.
4107db96d56Sopenharmony_ci
4117db96d56Sopenharmony_ci   This is the recommended way to allocate a new Unicode object.  Objects
4127db96d56Sopenharmony_ci   created using this function are not resizable.
4137db96d56Sopenharmony_ci
4147db96d56Sopenharmony_ci   .. versionadded:: 3.3
4157db96d56Sopenharmony_ci
4167db96d56Sopenharmony_ci
4177db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \
4187db96d56Sopenharmony_ci                                                    Py_ssize_t size)
4197db96d56Sopenharmony_ci
4207db96d56Sopenharmony_ci   Create a new Unicode object with the given *kind* (possible values are
4217db96d56Sopenharmony_ci   :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by
4227db96d56Sopenharmony_ci   :c:func:`PyUnicode_KIND`).  The *buffer* must point to an array of *size*
4237db96d56Sopenharmony_ci   units of 1, 2 or 4 bytes per character, as given by the kind.
4247db96d56Sopenharmony_ci
4257db96d56Sopenharmony_ci   If necessary, the input *buffer* is copied and transformed into the
4267db96d56Sopenharmony_ci   canonical representation.  For example, if the *buffer* is a UCS4 string
4277db96d56Sopenharmony_ci   (:c:macro:`PyUnicode_4BYTE_KIND`) and it consists only of codepoints in
4287db96d56Sopenharmony_ci   the UCS1 range, it will be transformed into UCS1
4297db96d56Sopenharmony_ci   (:c:macro:`PyUnicode_1BYTE_KIND`).
4307db96d56Sopenharmony_ci
4317db96d56Sopenharmony_ci   .. versionadded:: 3.3
4327db96d56Sopenharmony_ci
4337db96d56Sopenharmony_ci
4347db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
4357db96d56Sopenharmony_ci
4367db96d56Sopenharmony_ci   Create a Unicode object from the char buffer *u*.  The bytes will be
4377db96d56Sopenharmony_ci   interpreted as being UTF-8 encoded.  The buffer is copied into the new
4387db96d56Sopenharmony_ci   object. If the buffer is not ``NULL``, the return value might be a shared
4397db96d56Sopenharmony_ci   object, i.e. modification of the data is not allowed.
4407db96d56Sopenharmony_ci
4417db96d56Sopenharmony_ci   If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
4427db96d56Sopenharmony_ci   with the buffer set to ``NULL``.  This usage is deprecated in favor of
4437db96d56Sopenharmony_ci   :c:func:`PyUnicode_New`, and will be removed in Python 3.12.
4447db96d56Sopenharmony_ci
4457db96d56Sopenharmony_ci
4467db96d56Sopenharmony_ci.. c:function:: PyObject *PyUnicode_FromString(const char *u)
4477db96d56Sopenharmony_ci
4487db96d56Sopenharmony_ci   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
4497db96d56Sopenharmony_ci   *u*.
4507db96d56Sopenharmony_ci
4517db96d56Sopenharmony_ci
4527db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
4537db96d56Sopenharmony_ci
4547db96d56Sopenharmony_ci   Take a C :c:func:`printf`\ -style *format* string and a variable number of
4557db96d56Sopenharmony_ci   arguments, calculate the size of the resulting Python Unicode string and return
4567db96d56Sopenharmony_ci   a string with the values formatted into it.  The variable arguments must be C
4577db96d56Sopenharmony_ci   types and must correspond exactly to the format characters in the *format*
4587db96d56Sopenharmony_ci   ASCII-encoded string. The following format characters are allowed:
4597db96d56Sopenharmony_ci
4607db96d56Sopenharmony_ci   .. % This should be exactly the same as the table in PyErr_Format.
4617db96d56Sopenharmony_ci   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
4627db96d56Sopenharmony_ci   .. % because not all compilers support the %z width modifier -- we fake it
4637db96d56Sopenharmony_ci   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
4647db96d56Sopenharmony_ci   .. % Similar comments apply to the %ll width modifier and
4657db96d56Sopenharmony_ci
4667db96d56Sopenharmony_ci   .. tabularcolumns:: |l|l|L|
4677db96d56Sopenharmony_ci
4687db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4697db96d56Sopenharmony_ci   | Format Characters | Type                | Comment                          |
4707db96d56Sopenharmony_ci   +===================+=====================+==================================+
4717db96d56Sopenharmony_ci   | :attr:`%%`        | *n/a*               | The literal % character.         |
4727db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4737db96d56Sopenharmony_ci   | :attr:`%c`        | int                 | A single character,              |
4747db96d56Sopenharmony_ci   |                   |                     | represented as a C int.          |
4757db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4767db96d56Sopenharmony_ci   | :attr:`%d`        | int                 | Equivalent to                    |
4777db96d56Sopenharmony_ci   |                   |                     | ``printf("%d")``. [1]_           |
4787db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4797db96d56Sopenharmony_ci   | :attr:`%u`        | unsigned int        | Equivalent to                    |
4807db96d56Sopenharmony_ci   |                   |                     | ``printf("%u")``. [1]_           |
4817db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4827db96d56Sopenharmony_ci   | :attr:`%ld`       | long                | Equivalent to                    |
4837db96d56Sopenharmony_ci   |                   |                     | ``printf("%ld")``. [1]_          |
4847db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4857db96d56Sopenharmony_ci   | :attr:`%li`       | long                | Equivalent to                    |
4867db96d56Sopenharmony_ci   |                   |                     | ``printf("%li")``. [1]_          |
4877db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4887db96d56Sopenharmony_ci   | :attr:`%lu`       | unsigned long       | Equivalent to                    |
4897db96d56Sopenharmony_ci   |                   |                     | ``printf("%lu")``. [1]_          |
4907db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4917db96d56Sopenharmony_ci   | :attr:`%lld`      | long long           | Equivalent to                    |
4927db96d56Sopenharmony_ci   |                   |                     | ``printf("%lld")``. [1]_         |
4937db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4947db96d56Sopenharmony_ci   | :attr:`%lli`      | long long           | Equivalent to                    |
4957db96d56Sopenharmony_ci   |                   |                     | ``printf("%lli")``. [1]_         |
4967db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
4977db96d56Sopenharmony_ci   | :attr:`%llu`      | unsigned long long  | Equivalent to                    |
4987db96d56Sopenharmony_ci   |                   |                     | ``printf("%llu")``. [1]_         |
4997db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5007db96d56Sopenharmony_ci   | :attr:`%zd`       | :c:type:`\          | Equivalent to                    |
5017db96d56Sopenharmony_ci   |                   | Py_ssize_t`         | ``printf("%zd")``. [1]_          |
5027db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5037db96d56Sopenharmony_ci   | :attr:`%zi`       | :c:type:`\          | Equivalent to                    |
5047db96d56Sopenharmony_ci   |                   | Py_ssize_t`         | ``printf("%zi")``. [1]_          |
5057db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5067db96d56Sopenharmony_ci   | :attr:`%zu`       | size_t              | Equivalent to                    |
5077db96d56Sopenharmony_ci   |                   |                     | ``printf("%zu")``. [1]_          |
5087db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5097db96d56Sopenharmony_ci   | :attr:`%i`        | int                 | Equivalent to                    |
5107db96d56Sopenharmony_ci   |                   |                     | ``printf("%i")``. [1]_           |
5117db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5127db96d56Sopenharmony_ci   | :attr:`%x`        | int                 | Equivalent to                    |
5137db96d56Sopenharmony_ci   |                   |                     | ``printf("%x")``. [1]_           |
5147db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5157db96d56Sopenharmony_ci   | :attr:`%s`        | const char\*        | A null-terminated C character    |
5167db96d56Sopenharmony_ci   |                   |                     | array.                           |
5177db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5187db96d56Sopenharmony_ci   | :attr:`%p`        | const void\*        | The hex representation of a C    |
5197db96d56Sopenharmony_ci   |                   |                     | pointer. Mostly equivalent to    |
5207db96d56Sopenharmony_ci   |                   |                     | ``printf("%p")`` except that     |
5217db96d56Sopenharmony_ci   |                   |                     | it is guaranteed to start with   |
5227db96d56Sopenharmony_ci   |                   |                     | the literal ``0x`` regardless    |
5237db96d56Sopenharmony_ci   |                   |                     | of what the platform's           |
5247db96d56Sopenharmony_ci   |                   |                     | ``printf`` yields.               |
5257db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5267db96d56Sopenharmony_ci   | :attr:`%A`        | PyObject\*          | The result of calling            |
5277db96d56Sopenharmony_ci   |                   |                     | :func:`ascii`.                   |
5287db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5297db96d56Sopenharmony_ci   | :attr:`%U`        | PyObject\*          | A Unicode object.                |
5307db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5317db96d56Sopenharmony_ci   | :attr:`%V`        | PyObject\*,         | A Unicode object (which may be   |
5327db96d56Sopenharmony_ci   |                   | const char\*        | ``NULL``) and a null-terminated  |
5337db96d56Sopenharmony_ci   |                   |                     | C character array as a second    |
5347db96d56Sopenharmony_ci   |                   |                     | parameter (which will be used,   |
5357db96d56Sopenharmony_ci   |                   |                     | if the first parameter is        |
5367db96d56Sopenharmony_ci   |                   |                     | ``NULL``).                       |
5377db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5387db96d56Sopenharmony_ci   | :attr:`%S`        | PyObject\*          | The result of calling            |
5397db96d56Sopenharmony_ci   |                   |                     | :c:func:`PyObject_Str`.          |
5407db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5417db96d56Sopenharmony_ci   | :attr:`%R`        | PyObject\*          | The result of calling            |
5427db96d56Sopenharmony_ci   |                   |                     | :c:func:`PyObject_Repr`.         |
5437db96d56Sopenharmony_ci   +-------------------+---------------------+----------------------------------+
5447db96d56Sopenharmony_ci
5457db96d56Sopenharmony_ci   An unrecognized format character causes all the rest of the format string to be
5467db96d56Sopenharmony_ci   copied as-is to the result string, and any extra arguments discarded.
5477db96d56Sopenharmony_ci
5487db96d56Sopenharmony_ci   .. note::
5497db96d56Sopenharmony_ci      The width formatter unit is number of characters rather than bytes.
5507db96d56Sopenharmony_ci      The precision formatter unit is number of bytes for ``"%s"`` and
5517db96d56Sopenharmony_ci      ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of
5527db96d56Sopenharmony_ci      characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"``
5537db96d56Sopenharmony_ci      (if the ``PyObject*`` argument is not ``NULL``).
5547db96d56Sopenharmony_ci
5557db96d56Sopenharmony_ci   .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi,
5567db96d56Sopenharmony_ci      zu, i, x): the 0-conversion flag has effect even when a precision is given.
5577db96d56Sopenharmony_ci
5587db96d56Sopenharmony_ci   .. versionchanged:: 3.2
5597db96d56Sopenharmony_ci      Support for ``"%lld"`` and ``"%llu"`` added.
5607db96d56Sopenharmony_ci
5617db96d56Sopenharmony_ci   .. versionchanged:: 3.3
5627db96d56Sopenharmony_ci      Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added.
5637db96d56Sopenharmony_ci
5647db96d56Sopenharmony_ci   .. versionchanged:: 3.4
5657db96d56Sopenharmony_ci      Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``,
5667db96d56Sopenharmony_ci      ``"%V"``, ``"%S"``, ``"%R"`` added.
5677db96d56Sopenharmony_ci
5687db96d56Sopenharmony_ci
5697db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
5707db96d56Sopenharmony_ci
5717db96d56Sopenharmony_ci   Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two
5727db96d56Sopenharmony_ci   arguments.
5737db96d56Sopenharmony_ci
5747db96d56Sopenharmony_ci
5757db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
5767db96d56Sopenharmony_ci
5777db96d56Sopenharmony_ci   Copy an instance of a Unicode subtype to a new true Unicode object if
5787db96d56Sopenharmony_ci   necessary. If *obj* is already a true Unicode object (not a subtype),
5797db96d56Sopenharmony_ci   return the reference with incremented refcount.
5807db96d56Sopenharmony_ci
5817db96d56Sopenharmony_ci   Objects other than Unicode or its subtypes will cause a :exc:`TypeError`.
5827db96d56Sopenharmony_ci
5837db96d56Sopenharmony_ci
5847db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \
5857db96d56Sopenharmony_ci                               const char *encoding, const char *errors)
5867db96d56Sopenharmony_ci
5877db96d56Sopenharmony_ci   Decode an encoded object *obj* to a Unicode object.
5887db96d56Sopenharmony_ci
5897db96d56Sopenharmony_ci   :class:`bytes`, :class:`bytearray` and other
5907db96d56Sopenharmony_ci   :term:`bytes-like objects <bytes-like object>`
5917db96d56Sopenharmony_ci   are decoded according to the given *encoding* and using the error handling
5927db96d56Sopenharmony_ci   defined by *errors*. Both can be ``NULL`` to have the interface use the default
5937db96d56Sopenharmony_ci   values (see :ref:`builtincodecs` for details).
5947db96d56Sopenharmony_ci
5957db96d56Sopenharmony_ci   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
5967db96d56Sopenharmony_ci   set.
5977db96d56Sopenharmony_ci
5987db96d56Sopenharmony_ci   The API returns ``NULL`` if there was an error.  The caller is responsible for
5997db96d56Sopenharmony_ci   decref'ing the returned objects.
6007db96d56Sopenharmony_ci
6017db96d56Sopenharmony_ci
6027db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode)
6037db96d56Sopenharmony_ci
6047db96d56Sopenharmony_ci   Return the length of the Unicode object, in code points.
6057db96d56Sopenharmony_ci
6067db96d56Sopenharmony_ci   .. versionadded:: 3.3
6077db96d56Sopenharmony_ci
6087db96d56Sopenharmony_ci
6097db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \
6107db96d56Sopenharmony_ci                                                    Py_ssize_t to_start, \
6117db96d56Sopenharmony_ci                                                    PyObject *from, \
6127db96d56Sopenharmony_ci                                                    Py_ssize_t from_start, \
6137db96d56Sopenharmony_ci                                                    Py_ssize_t how_many)
6147db96d56Sopenharmony_ci
6157db96d56Sopenharmony_ci   Copy characters from one Unicode object into another.  This function performs
6167db96d56Sopenharmony_ci   character conversion when necessary and falls back to :c:func:`memcpy` if
6177db96d56Sopenharmony_ci   possible.  Returns ``-1`` and sets an exception on error, otherwise returns
6187db96d56Sopenharmony_ci   the number of copied characters.
6197db96d56Sopenharmony_ci
6207db96d56Sopenharmony_ci   .. versionadded:: 3.3
6217db96d56Sopenharmony_ci
6227db96d56Sopenharmony_ci
6237db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \
6247db96d56Sopenharmony_ci                        Py_ssize_t length, Py_UCS4 fill_char)
6257db96d56Sopenharmony_ci
6267db96d56Sopenharmony_ci   Fill a string with a character: write *fill_char* into
6277db96d56Sopenharmony_ci   ``unicode[start:start+length]``.
6287db96d56Sopenharmony_ci
6297db96d56Sopenharmony_ci   Fail if *fill_char* is bigger than the string maximum character, or if the
6307db96d56Sopenharmony_ci   string has more than 1 reference.
6317db96d56Sopenharmony_ci
6327db96d56Sopenharmony_ci   Return the number of written character, or return ``-1`` and raise an
6337db96d56Sopenharmony_ci   exception on error.
6347db96d56Sopenharmony_ci
6357db96d56Sopenharmony_ci   .. versionadded:: 3.3
6367db96d56Sopenharmony_ci
6377db96d56Sopenharmony_ci
6387db96d56Sopenharmony_ci.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \
6397db96d56Sopenharmony_ci                                        Py_UCS4 character)
6407db96d56Sopenharmony_ci
6417db96d56Sopenharmony_ci   Write a character to a string.  The string must have been created through
6427db96d56Sopenharmony_ci   :c:func:`PyUnicode_New`.  Since Unicode strings are supposed to be immutable,
6437db96d56Sopenharmony_ci   the string must not be shared, or have been hashed yet.
6447db96d56Sopenharmony_ci
6457db96d56Sopenharmony_ci   This function checks that *unicode* is a Unicode object, that the index is
6467db96d56Sopenharmony_ci   not out of bounds, and that the object can be modified safely (i.e. that it
6477db96d56Sopenharmony_ci   its reference count is one).
6487db96d56Sopenharmony_ci
6497db96d56Sopenharmony_ci   .. versionadded:: 3.3
6507db96d56Sopenharmony_ci
6517db96d56Sopenharmony_ci
6527db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index)
6537db96d56Sopenharmony_ci
6547db96d56Sopenharmony_ci   Read a character from a string.  This function checks that *unicode* is a
6557db96d56Sopenharmony_ci   Unicode object and the index is not out of bounds, in contrast to
6567db96d56Sopenharmony_ci   :c:func:`PyUnicode_READ_CHAR`, which performs no error checking.
6577db96d56Sopenharmony_ci
6587db96d56Sopenharmony_ci   .. versionadded:: 3.3
6597db96d56Sopenharmony_ci
6607db96d56Sopenharmony_ci
6617db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \
6627db96d56Sopenharmony_ci                                              Py_ssize_t end)
6637db96d56Sopenharmony_ci
6647db96d56Sopenharmony_ci   Return a substring of *str*, from character index *start* (included) to
6657db96d56Sopenharmony_ci   character index *end* (excluded).  Negative indices are not supported.
6667db96d56Sopenharmony_ci
6677db96d56Sopenharmony_ci   .. versionadded:: 3.3
6687db96d56Sopenharmony_ci
6697db96d56Sopenharmony_ci
6707db96d56Sopenharmony_ci.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \
6717db96d56Sopenharmony_ci                                          Py_ssize_t buflen, int copy_null)
6727db96d56Sopenharmony_ci
6737db96d56Sopenharmony_ci   Copy the string *u* into a UCS4 buffer, including a null character, if
6747db96d56Sopenharmony_ci   *copy_null* is set.  Returns ``NULL`` and sets an exception on error (in
6757db96d56Sopenharmony_ci   particular, a :exc:`SystemError` if *buflen* is smaller than the length of
6767db96d56Sopenharmony_ci   *u*).  *buffer* is returned on success.
6777db96d56Sopenharmony_ci
6787db96d56Sopenharmony_ci   .. versionadded:: 3.3
6797db96d56Sopenharmony_ci
6807db96d56Sopenharmony_ci
6817db96d56Sopenharmony_ci.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u)
6827db96d56Sopenharmony_ci
6837db96d56Sopenharmony_ci   Copy the string *u* into a new UCS4 buffer that is allocated using
6847db96d56Sopenharmony_ci   :c:func:`PyMem_Malloc`.  If this fails, ``NULL`` is returned with a
6857db96d56Sopenharmony_ci   :exc:`MemoryError` set.  The returned buffer always has an extra
6867db96d56Sopenharmony_ci   null code point appended.
6877db96d56Sopenharmony_ci
6887db96d56Sopenharmony_ci   .. versionadded:: 3.3
6897db96d56Sopenharmony_ci
6907db96d56Sopenharmony_ci
6917db96d56Sopenharmony_ciDeprecated Py_UNICODE APIs
6927db96d56Sopenharmony_ci""""""""""""""""""""""""""
6937db96d56Sopenharmony_ci
6947db96d56Sopenharmony_ci.. deprecated-removed:: 3.3 3.12
6957db96d56Sopenharmony_ci
6967db96d56Sopenharmony_ciThese API functions are deprecated with the implementation of :pep:`393`.
6977db96d56Sopenharmony_ciExtension modules can continue using them, as they will not be removed in Python
6987db96d56Sopenharmony_ci3.x, but need to be aware that their use can now cause performance and memory hits.
6997db96d56Sopenharmony_ci
7007db96d56Sopenharmony_ci
7017db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
7027db96d56Sopenharmony_ci
7037db96d56Sopenharmony_ci   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
7047db96d56Sopenharmony_ci   may be ``NULL`` which causes the contents to be undefined. It is the user's
7057db96d56Sopenharmony_ci   responsibility to fill in the needed data.  The buffer is copied into the new
7067db96d56Sopenharmony_ci   object.
7077db96d56Sopenharmony_ci
7087db96d56Sopenharmony_ci   If the buffer is not ``NULL``, the return value might be a shared object.
7097db96d56Sopenharmony_ci   Therefore, modification of the resulting Unicode object is only allowed when
7107db96d56Sopenharmony_ci   *u* is ``NULL``.
7117db96d56Sopenharmony_ci
7127db96d56Sopenharmony_ci   If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
7137db96d56Sopenharmony_ci   string content has been filled before using any of the access macros such as
7147db96d56Sopenharmony_ci   :c:func:`PyUnicode_KIND`.
7157db96d56Sopenharmony_ci
7167db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
7177db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
7187db96d56Sopenharmony_ci      :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or
7197db96d56Sopenharmony_ci      :c:func:`PyUnicode_New`.
7207db96d56Sopenharmony_ci
7217db96d56Sopenharmony_ci
7227db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
7237db96d56Sopenharmony_ci
7247db96d56Sopenharmony_ci   Return a read-only pointer to the Unicode object's internal
7257db96d56Sopenharmony_ci   :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
7267db96d56Sopenharmony_ci   :c:expr:`Py_UNICODE*` representation of the object if it is not yet
7277db96d56Sopenharmony_ci   available. The buffer is always terminated with an extra null code point.
7287db96d56Sopenharmony_ci   Note that the resulting :c:type:`Py_UNICODE` string may also contain
7297db96d56Sopenharmony_ci   embedded null code points, which would cause the string to be truncated when
7307db96d56Sopenharmony_ci   used in most C functions.
7317db96d56Sopenharmony_ci
7327db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
7337db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
7347db96d56Sopenharmony_ci      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
7357db96d56Sopenharmony_ci      :c:func:`PyUnicode_ReadChar` or similar new APIs.
7367db96d56Sopenharmony_ci
7377db96d56Sopenharmony_ci
7387db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
7397db96d56Sopenharmony_ci
7407db96d56Sopenharmony_ci   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
7417db96d56Sopenharmony_ci   array length (excluding the extra null terminator) in *size*.
7427db96d56Sopenharmony_ci   Note that the resulting :c:expr:`Py_UNICODE*` string
7437db96d56Sopenharmony_ci   may contain embedded null code points, which would cause the string to be
7447db96d56Sopenharmony_ci   truncated when used in most C functions.
7457db96d56Sopenharmony_ci
7467db96d56Sopenharmony_ci   .. versionadded:: 3.3
7477db96d56Sopenharmony_ci
7487db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
7497db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
7507db96d56Sopenharmony_ci      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
7517db96d56Sopenharmony_ci      :c:func:`PyUnicode_ReadChar` or similar new APIs.
7527db96d56Sopenharmony_ci
7537db96d56Sopenharmony_ci
7547db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
7557db96d56Sopenharmony_ci
7567db96d56Sopenharmony_ci   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
7577db96d56Sopenharmony_ci   code units (this includes surrogate pairs as 2 units).
7587db96d56Sopenharmony_ci
7597db96d56Sopenharmony_ci   .. deprecated-removed:: 3.3 3.12
7607db96d56Sopenharmony_ci      Part of the old-style Unicode API, please migrate to using
7617db96d56Sopenharmony_ci      :c:func:`PyUnicode_GET_LENGTH`.
7627db96d56Sopenharmony_ci
7637db96d56Sopenharmony_ci
7647db96d56Sopenharmony_ciLocale Encoding
7657db96d56Sopenharmony_ci"""""""""""""""
7667db96d56Sopenharmony_ci
7677db96d56Sopenharmony_ciThe current locale encoding can be used to decode text from the operating
7687db96d56Sopenharmony_cisystem.
7697db96d56Sopenharmony_ci
7707db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \
7717db96d56Sopenharmony_ci                                                        Py_ssize_t len, \
7727db96d56Sopenharmony_ci                                                        const char *errors)
7737db96d56Sopenharmony_ci
7747db96d56Sopenharmony_ci   Decode a string from UTF-8 on Android and VxWorks, or from the current
7757db96d56Sopenharmony_ci   locale encoding on other platforms. The supported
7767db96d56Sopenharmony_ci   error handlers are ``"strict"`` and ``"surrogateescape"``
7777db96d56Sopenharmony_ci   (:pep:`383`). The decoder uses ``"strict"`` error handler if
7787db96d56Sopenharmony_ci   *errors* is ``NULL``.  *str* must end with a null character but
7797db96d56Sopenharmony_ci   cannot contain embedded null characters.
7807db96d56Sopenharmony_ci
7817db96d56Sopenharmony_ci   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from
7827db96d56Sopenharmony_ci   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
7837db96d56Sopenharmony_ci   Python startup).
7847db96d56Sopenharmony_ci
7857db96d56Sopenharmony_ci   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
7867db96d56Sopenharmony_ci
7877db96d56Sopenharmony_ci   .. seealso::
7887db96d56Sopenharmony_ci
7897db96d56Sopenharmony_ci      The :c:func:`Py_DecodeLocale` function.
7907db96d56Sopenharmony_ci
7917db96d56Sopenharmony_ci   .. versionadded:: 3.3
7927db96d56Sopenharmony_ci
7937db96d56Sopenharmony_ci   .. versionchanged:: 3.7
7947db96d56Sopenharmony_ci      The function now also uses the current locale encoding for the
7957db96d56Sopenharmony_ci      ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale`
7967db96d56Sopenharmony_ci      was used for the ``surrogateescape``, and the current locale encoding was
7977db96d56Sopenharmony_ci      used for ``strict``.
7987db96d56Sopenharmony_ci
7997db96d56Sopenharmony_ci
8007db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
8017db96d56Sopenharmony_ci
8027db96d56Sopenharmony_ci   Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string
8037db96d56Sopenharmony_ci   length using :c:func:`strlen`.
8047db96d56Sopenharmony_ci
8057db96d56Sopenharmony_ci   .. versionadded:: 3.3
8067db96d56Sopenharmony_ci
8077db96d56Sopenharmony_ci
8087db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors)
8097db96d56Sopenharmony_ci
8107db96d56Sopenharmony_ci   Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current
8117db96d56Sopenharmony_ci   locale encoding on other platforms. The
8127db96d56Sopenharmony_ci   supported error handlers are ``"strict"`` and ``"surrogateescape"``
8137db96d56Sopenharmony_ci   (:pep:`383`). The encoder uses ``"strict"`` error handler if
8147db96d56Sopenharmony_ci   *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot
8157db96d56Sopenharmony_ci   contain embedded null characters.
8167db96d56Sopenharmony_ci
8177db96d56Sopenharmony_ci   Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to
8187db96d56Sopenharmony_ci   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
8197db96d56Sopenharmony_ci   Python startup).
8207db96d56Sopenharmony_ci
8217db96d56Sopenharmony_ci   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
8227db96d56Sopenharmony_ci
8237db96d56Sopenharmony_ci   .. seealso::
8247db96d56Sopenharmony_ci
8257db96d56Sopenharmony_ci      The :c:func:`Py_EncodeLocale` function.
8267db96d56Sopenharmony_ci
8277db96d56Sopenharmony_ci   .. versionadded:: 3.3
8287db96d56Sopenharmony_ci
8297db96d56Sopenharmony_ci   .. versionchanged:: 3.7
8307db96d56Sopenharmony_ci      The function now also uses the current locale encoding for the
8317db96d56Sopenharmony_ci      ``surrogateescape`` error handler, except on Android. Previously,
8327db96d56Sopenharmony_ci      :c:func:`Py_EncodeLocale`
8337db96d56Sopenharmony_ci      was used for the ``surrogateescape``, and the current locale encoding was
8347db96d56Sopenharmony_ci      used for ``strict``.
8357db96d56Sopenharmony_ci
8367db96d56Sopenharmony_ci
8377db96d56Sopenharmony_ciFile System Encoding
8387db96d56Sopenharmony_ci""""""""""""""""""""
8397db96d56Sopenharmony_ci
8407db96d56Sopenharmony_ciTo encode and decode file names and other environment strings,
8417db96d56Sopenharmony_ci:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
8427db96d56Sopenharmony_ci:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
8437db96d56Sopenharmony_ci(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
8447db96d56Sopenharmony_ciargument parsing, the ``"O&"`` converter should be used, passing
8457db96d56Sopenharmony_ci:c:func:`PyUnicode_FSConverter` as the conversion function:
8467db96d56Sopenharmony_ci
8477db96d56Sopenharmony_ci.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
8487db96d56Sopenharmony_ci
8497db96d56Sopenharmony_ci   ParseTuple converter: encode :class:`str` objects -- obtained directly or
8507db96d56Sopenharmony_ci   through the :class:`os.PathLike` interface -- to :class:`bytes` using
8517db96d56Sopenharmony_ci   :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
8527db96d56Sopenharmony_ci   *result* must be a :c:expr:`PyBytesObject*` which must be released when it is
8537db96d56Sopenharmony_ci   no longer used.
8547db96d56Sopenharmony_ci
8557db96d56Sopenharmony_ci   .. versionadded:: 3.1
8567db96d56Sopenharmony_ci
8577db96d56Sopenharmony_ci   .. versionchanged:: 3.6
8587db96d56Sopenharmony_ci      Accepts a :term:`path-like object`.
8597db96d56Sopenharmony_ci
8607db96d56Sopenharmony_ciTo decode file names to :class:`str` during argument parsing, the ``"O&"``
8617db96d56Sopenharmony_ciconverter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
8627db96d56Sopenharmony_ciconversion function:
8637db96d56Sopenharmony_ci
8647db96d56Sopenharmony_ci.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
8657db96d56Sopenharmony_ci
8667db96d56Sopenharmony_ci   ParseTuple converter: decode :class:`bytes` objects -- obtained either
8677db96d56Sopenharmony_ci   directly or indirectly through the :class:`os.PathLike` interface -- to
8687db96d56Sopenharmony_ci   :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str`
8697db96d56Sopenharmony_ci   objects are output as-is. *result* must be a :c:expr:`PyUnicodeObject*` which
8707db96d56Sopenharmony_ci   must be released when it is no longer used.
8717db96d56Sopenharmony_ci
8727db96d56Sopenharmony_ci   .. versionadded:: 3.2
8737db96d56Sopenharmony_ci
8747db96d56Sopenharmony_ci   .. versionchanged:: 3.6
8757db96d56Sopenharmony_ci      Accepts a :term:`path-like object`.
8767db96d56Sopenharmony_ci
8777db96d56Sopenharmony_ci
8787db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
8797db96d56Sopenharmony_ci
8807db96d56Sopenharmony_ci   Decode a string from the :term:`filesystem encoding and error handler`.
8817db96d56Sopenharmony_ci
8827db96d56Sopenharmony_ci   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
8837db96d56Sopenharmony_ci   locale encoding.
8847db96d56Sopenharmony_ci
8857db96d56Sopenharmony_ci   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
8867db96d56Sopenharmony_ci   locale encoding and cannot be modified later. If you need to decode a string
8877db96d56Sopenharmony_ci   from the current locale encoding, use
8887db96d56Sopenharmony_ci   :c:func:`PyUnicode_DecodeLocaleAndSize`.
8897db96d56Sopenharmony_ci
8907db96d56Sopenharmony_ci   .. seealso::
8917db96d56Sopenharmony_ci
8927db96d56Sopenharmony_ci      The :c:func:`Py_DecodeLocale` function.
8937db96d56Sopenharmony_ci
8947db96d56Sopenharmony_ci   .. versionchanged:: 3.6
8957db96d56Sopenharmony_ci      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
8967db96d56Sopenharmony_ci
8977db96d56Sopenharmony_ci
8987db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
8997db96d56Sopenharmony_ci
9007db96d56Sopenharmony_ci   Decode a null-terminated string from the :term:`filesystem encoding and
9017db96d56Sopenharmony_ci   error handler`.
9027db96d56Sopenharmony_ci
9037db96d56Sopenharmony_ci   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
9047db96d56Sopenharmony_ci   locale encoding.
9057db96d56Sopenharmony_ci
9067db96d56Sopenharmony_ci   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
9077db96d56Sopenharmony_ci
9087db96d56Sopenharmony_ci   .. versionchanged:: 3.6
9097db96d56Sopenharmony_ci      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
9107db96d56Sopenharmony_ci
9117db96d56Sopenharmony_ci
9127db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
9137db96d56Sopenharmony_ci
9147db96d56Sopenharmony_ci   Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
9157db96d56Sopenharmony_ci   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
9167db96d56Sopenharmony_ci   :class:`bytes`. Note that the resulting :class:`bytes` object may contain
9177db96d56Sopenharmony_ci   null bytes.
9187db96d56Sopenharmony_ci
9197db96d56Sopenharmony_ci   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
9207db96d56Sopenharmony_ci   locale encoding.
9217db96d56Sopenharmony_ci
9227db96d56Sopenharmony_ci   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
9237db96d56Sopenharmony_ci   locale encoding and cannot be modified later. If you need to encode a string
9247db96d56Sopenharmony_ci   to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`.
9257db96d56Sopenharmony_ci
9267db96d56Sopenharmony_ci   .. seealso::
9277db96d56Sopenharmony_ci
9287db96d56Sopenharmony_ci      The :c:func:`Py_EncodeLocale` function.
9297db96d56Sopenharmony_ci
9307db96d56Sopenharmony_ci   .. versionadded:: 3.2
9317db96d56Sopenharmony_ci
9327db96d56Sopenharmony_ci   .. versionchanged:: 3.6
9337db96d56Sopenharmony_ci      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
9347db96d56Sopenharmony_ci
9357db96d56Sopenharmony_ciwchar_t Support
9367db96d56Sopenharmony_ci"""""""""""""""
9377db96d56Sopenharmony_ci
9387db96d56Sopenharmony_ci:c:expr:`wchar_t` support for platforms which support it:
9397db96d56Sopenharmony_ci
9407db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
9417db96d56Sopenharmony_ci
9427db96d56Sopenharmony_ci   Create a Unicode object from the :c:expr:`wchar_t` buffer *w* of the given *size*.
9437db96d56Sopenharmony_ci   Passing ``-1`` as the *size* indicates that the function must itself compute the length,
9447db96d56Sopenharmony_ci   using wcslen.
9457db96d56Sopenharmony_ci   Return ``NULL`` on failure.
9467db96d56Sopenharmony_ci
9477db96d56Sopenharmony_ci
9487db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size)
9497db96d56Sopenharmony_ci
9507db96d56Sopenharmony_ci   Copy the Unicode object contents into the :c:expr:`wchar_t` buffer *w*.  At most
9517db96d56Sopenharmony_ci   *size* :c:expr:`wchar_t` characters are copied (excluding a possibly trailing
9527db96d56Sopenharmony_ci   null termination character).  Return the number of :c:expr:`wchar_t` characters
9537db96d56Sopenharmony_ci   copied or ``-1`` in case of an error.  Note that the resulting :c:expr:`wchar_t*`
9547db96d56Sopenharmony_ci   string may or may not be null-terminated.  It is the responsibility of the caller
9557db96d56Sopenharmony_ci   to make sure that the :c:expr:`wchar_t*` string is null-terminated in case this is
9567db96d56Sopenharmony_ci   required by the application. Also, note that the :c:expr:`wchar_t*` string
9577db96d56Sopenharmony_ci   might contain null characters, which would cause the string to be truncated
9587db96d56Sopenharmony_ci   when used with most C functions.
9597db96d56Sopenharmony_ci
9607db96d56Sopenharmony_ci
9617db96d56Sopenharmony_ci.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
9627db96d56Sopenharmony_ci
9637db96d56Sopenharmony_ci   Convert the Unicode object to a wide character string. The output string
9647db96d56Sopenharmony_ci   always ends with a null character. If *size* is not ``NULL``, write the number
9657db96d56Sopenharmony_ci   of wide characters (excluding the trailing null termination character) into
9667db96d56Sopenharmony_ci   *\*size*. Note that the resulting :c:expr:`wchar_t` string might contain
9677db96d56Sopenharmony_ci   null characters, which would cause the string to be truncated when used with
9687db96d56Sopenharmony_ci   most C functions. If *size* is ``NULL`` and the :c:expr:`wchar_t*` string
9697db96d56Sopenharmony_ci   contains null characters a :exc:`ValueError` is raised.
9707db96d56Sopenharmony_ci
9717db96d56Sopenharmony_ci   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
9727db96d56Sopenharmony_ci   :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL``
9737db96d56Sopenharmony_ci   and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation
9747db96d56Sopenharmony_ci   is failed.
9757db96d56Sopenharmony_ci
9767db96d56Sopenharmony_ci   .. versionadded:: 3.2
9777db96d56Sopenharmony_ci
9787db96d56Sopenharmony_ci   .. versionchanged:: 3.7
9797db96d56Sopenharmony_ci      Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:expr:`wchar_t*`
9807db96d56Sopenharmony_ci      string contains null characters.
9817db96d56Sopenharmony_ci
9827db96d56Sopenharmony_ci
9837db96d56Sopenharmony_ci.. _builtincodecs:
9847db96d56Sopenharmony_ci
9857db96d56Sopenharmony_ciBuilt-in Codecs
9867db96d56Sopenharmony_ci^^^^^^^^^^^^^^^
9877db96d56Sopenharmony_ci
9887db96d56Sopenharmony_ciPython provides a set of built-in codecs which are written in C for speed. All of
9897db96d56Sopenharmony_cithese codecs are directly usable via the following functions.
9907db96d56Sopenharmony_ci
9917db96d56Sopenharmony_ciMany of the following APIs take two arguments encoding and errors, and they
9927db96d56Sopenharmony_cihave the same semantics as the ones of the built-in :func:`str` string object
9937db96d56Sopenharmony_ciconstructor.
9947db96d56Sopenharmony_ci
9957db96d56Sopenharmony_ciSetting encoding to ``NULL`` causes the default encoding to be used
9967db96d56Sopenharmony_ciwhich is UTF-8.  The file system calls should use
9977db96d56Sopenharmony_ci:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the
9987db96d56Sopenharmony_civariable :c:data:`Py_FileSystemDefaultEncoding` internally. This
9997db96d56Sopenharmony_civariable should be treated as read-only: on some systems, it will be a
10007db96d56Sopenharmony_cipointer to a static string, on others, it will change at run-time
10017db96d56Sopenharmony_ci(such as when the application invokes setlocale).
10027db96d56Sopenharmony_ci
10037db96d56Sopenharmony_ciError handling is set by errors which may also be set to ``NULL`` meaning to use
10047db96d56Sopenharmony_cithe default handling defined for the codec.  Default error handling for all
10057db96d56Sopenharmony_cibuilt-in codecs is "strict" (:exc:`ValueError` is raised).
10067db96d56Sopenharmony_ci
10077db96d56Sopenharmony_ciThe codecs all use a similar interface.  Only deviations from the following
10087db96d56Sopenharmony_cigeneric ones are documented for simplicity.
10097db96d56Sopenharmony_ci
10107db96d56Sopenharmony_ci
10117db96d56Sopenharmony_ciGeneric Codecs
10127db96d56Sopenharmony_ci""""""""""""""
10137db96d56Sopenharmony_ci
10147db96d56Sopenharmony_ciThese are the generic codec APIs:
10157db96d56Sopenharmony_ci
10167db96d56Sopenharmony_ci
10177db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \
10187db96d56Sopenharmony_ci                              const char *encoding, const char *errors)
10197db96d56Sopenharmony_ci
10207db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
10217db96d56Sopenharmony_ci   *encoding* and *errors* have the same meaning as the parameters of the same name
10227db96d56Sopenharmony_ci   in the :func:`str` built-in function.  The codec to be used is looked up
10237db96d56Sopenharmony_ci   using the Python codec registry.  Return ``NULL`` if an exception was raised by
10247db96d56Sopenharmony_ci   the codec.
10257db96d56Sopenharmony_ci
10267db96d56Sopenharmony_ci
10277db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \
10287db96d56Sopenharmony_ci                              const char *encoding, const char *errors)
10297db96d56Sopenharmony_ci
10307db96d56Sopenharmony_ci   Encode a Unicode object and return the result as Python bytes object.
10317db96d56Sopenharmony_ci   *encoding* and *errors* have the same meaning as the parameters of the same
10327db96d56Sopenharmony_ci   name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up
10337db96d56Sopenharmony_ci   using the Python codec registry. Return ``NULL`` if an exception was raised by
10347db96d56Sopenharmony_ci   the codec.
10357db96d56Sopenharmony_ci
10367db96d56Sopenharmony_ci
10377db96d56Sopenharmony_ciUTF-8 Codecs
10387db96d56Sopenharmony_ci""""""""""""
10397db96d56Sopenharmony_ci
10407db96d56Sopenharmony_ciThese are the UTF-8 codec APIs:
10417db96d56Sopenharmony_ci
10427db96d56Sopenharmony_ci
10437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
10447db96d56Sopenharmony_ci
10457db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
10467db96d56Sopenharmony_ci   *s*. Return ``NULL`` if an exception was raised by the codec.
10477db96d56Sopenharmony_ci
10487db96d56Sopenharmony_ci
10497db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \
10507db96d56Sopenharmony_ci                              const char *errors, Py_ssize_t *consumed)
10517db96d56Sopenharmony_ci
10527db96d56Sopenharmony_ci   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If
10537db96d56Sopenharmony_ci   *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be
10547db96d56Sopenharmony_ci   treated as an error. Those bytes will not be decoded and the number of bytes
10557db96d56Sopenharmony_ci   that have been decoded will be stored in *consumed*.
10567db96d56Sopenharmony_ci
10577db96d56Sopenharmony_ci
10587db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
10597db96d56Sopenharmony_ci
10607db96d56Sopenharmony_ci   Encode a Unicode object using UTF-8 and return the result as Python bytes
10617db96d56Sopenharmony_ci   object.  Error handling is "strict".  Return ``NULL`` if an exception was
10627db96d56Sopenharmony_ci   raised by the codec.
10637db96d56Sopenharmony_ci
10647db96d56Sopenharmony_ci
10657db96d56Sopenharmony_ci.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
10667db96d56Sopenharmony_ci
10677db96d56Sopenharmony_ci   Return a pointer to the UTF-8 encoding of the Unicode object, and
10687db96d56Sopenharmony_ci   store the size of the encoded representation (in bytes) in *size*.  The
10697db96d56Sopenharmony_ci   *size* argument can be ``NULL``; in this case no size will be stored.  The
10707db96d56Sopenharmony_ci   returned buffer always has an extra null byte appended (not included in
10717db96d56Sopenharmony_ci   *size*), regardless of whether there are any other null code points.
10727db96d56Sopenharmony_ci
10737db96d56Sopenharmony_ci   In the case of an error, ``NULL`` is returned with an exception set and no
10747db96d56Sopenharmony_ci   *size* is stored.
10757db96d56Sopenharmony_ci
10767db96d56Sopenharmony_ci   This caches the UTF-8 representation of the string in the Unicode object, and
10777db96d56Sopenharmony_ci   subsequent calls will return a pointer to the same buffer.  The caller is not
10787db96d56Sopenharmony_ci   responsible for deallocating the buffer. The buffer is deallocated and
10797db96d56Sopenharmony_ci   pointers to it become invalid when the Unicode object is garbage collected.
10807db96d56Sopenharmony_ci
10817db96d56Sopenharmony_ci   .. versionadded:: 3.3
10827db96d56Sopenharmony_ci
10837db96d56Sopenharmony_ci   .. versionchanged:: 3.7
10847db96d56Sopenharmony_ci      The return type is now ``const char *`` rather of ``char *``.
10857db96d56Sopenharmony_ci
10867db96d56Sopenharmony_ci   .. versionchanged:: 3.10
10877db96d56Sopenharmony_ci      This function is a part of the :ref:`limited API <stable>`.
10887db96d56Sopenharmony_ci
10897db96d56Sopenharmony_ci
10907db96d56Sopenharmony_ci.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
10917db96d56Sopenharmony_ci
10927db96d56Sopenharmony_ci   As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
10937db96d56Sopenharmony_ci
10947db96d56Sopenharmony_ci   .. versionadded:: 3.3
10957db96d56Sopenharmony_ci
10967db96d56Sopenharmony_ci   .. versionchanged:: 3.7
10977db96d56Sopenharmony_ci      The return type is now ``const char *`` rather of ``char *``.
10987db96d56Sopenharmony_ci
10997db96d56Sopenharmony_ci
11007db96d56Sopenharmony_ciUTF-32 Codecs
11017db96d56Sopenharmony_ci"""""""""""""
11027db96d56Sopenharmony_ci
11037db96d56Sopenharmony_ciThese are the UTF-32 codec APIs:
11047db96d56Sopenharmony_ci
11057db96d56Sopenharmony_ci
11067db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \
11077db96d56Sopenharmony_ci                              const char *errors, int *byteorder)
11087db96d56Sopenharmony_ci
11097db96d56Sopenharmony_ci   Decode *size* bytes from a UTF-32 encoded buffer string and return the
11107db96d56Sopenharmony_ci   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
11117db96d56Sopenharmony_ci   handling. It defaults to "strict".
11127db96d56Sopenharmony_ci
11137db96d56Sopenharmony_ci   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
11147db96d56Sopenharmony_ci   order::
11157db96d56Sopenharmony_ci
11167db96d56Sopenharmony_ci      *byteorder == -1: little endian
11177db96d56Sopenharmony_ci      *byteorder == 0:  native order
11187db96d56Sopenharmony_ci      *byteorder == 1:  big endian
11197db96d56Sopenharmony_ci
11207db96d56Sopenharmony_ci   If ``*byteorder`` is zero, and the first four bytes of the input data are a
11217db96d56Sopenharmony_ci   byte order mark (BOM), the decoder switches to this byte order and the BOM is
11227db96d56Sopenharmony_ci   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
11237db96d56Sopenharmony_ci   ``1``, any byte order mark is copied to the output.
11247db96d56Sopenharmony_ci
11257db96d56Sopenharmony_ci   After completion, *\*byteorder* is set to the current byte order at the end
11267db96d56Sopenharmony_ci   of input data.
11277db96d56Sopenharmony_ci
11287db96d56Sopenharmony_ci   If *byteorder* is ``NULL``, the codec starts in native order mode.
11297db96d56Sopenharmony_ci
11307db96d56Sopenharmony_ci   Return ``NULL`` if an exception was raised by the codec.
11317db96d56Sopenharmony_ci
11327db96d56Sopenharmony_ci
11337db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \
11347db96d56Sopenharmony_ci                              const char *errors, int *byteorder, Py_ssize_t *consumed)
11357db96d56Sopenharmony_ci
11367db96d56Sopenharmony_ci   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If
11377db96d56Sopenharmony_ci   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
11387db96d56Sopenharmony_ci   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
11397db96d56Sopenharmony_ci   by four) as an error. Those bytes will not be decoded and the number of bytes
11407db96d56Sopenharmony_ci   that have been decoded will be stored in *consumed*.
11417db96d56Sopenharmony_ci
11427db96d56Sopenharmony_ci
11437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
11447db96d56Sopenharmony_ci
11457db96d56Sopenharmony_ci   Return a Python byte string using the UTF-32 encoding in native byte
11467db96d56Sopenharmony_ci   order. The string always starts with a BOM mark.  Error handling is "strict".
11477db96d56Sopenharmony_ci   Return ``NULL`` if an exception was raised by the codec.
11487db96d56Sopenharmony_ci
11497db96d56Sopenharmony_ci
11507db96d56Sopenharmony_ciUTF-16 Codecs
11517db96d56Sopenharmony_ci"""""""""""""
11527db96d56Sopenharmony_ci
11537db96d56Sopenharmony_ciThese are the UTF-16 codec APIs:
11547db96d56Sopenharmony_ci
11557db96d56Sopenharmony_ci
11567db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \
11577db96d56Sopenharmony_ci                              const char *errors, int *byteorder)
11587db96d56Sopenharmony_ci
11597db96d56Sopenharmony_ci   Decode *size* bytes from a UTF-16 encoded buffer string and return the
11607db96d56Sopenharmony_ci   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
11617db96d56Sopenharmony_ci   handling. It defaults to "strict".
11627db96d56Sopenharmony_ci
11637db96d56Sopenharmony_ci   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
11647db96d56Sopenharmony_ci   order::
11657db96d56Sopenharmony_ci
11667db96d56Sopenharmony_ci      *byteorder == -1: little endian
11677db96d56Sopenharmony_ci      *byteorder == 0:  native order
11687db96d56Sopenharmony_ci      *byteorder == 1:  big endian
11697db96d56Sopenharmony_ci
11707db96d56Sopenharmony_ci   If ``*byteorder`` is zero, and the first two bytes of the input data are a
11717db96d56Sopenharmony_ci   byte order mark (BOM), the decoder switches to this byte order and the BOM is
11727db96d56Sopenharmony_ci   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
11737db96d56Sopenharmony_ci   ``1``, any byte order mark is copied to the output (where it will result in
11747db96d56Sopenharmony_ci   either a ``\ufeff`` or a ``\ufffe`` character).
11757db96d56Sopenharmony_ci
11767db96d56Sopenharmony_ci   After completion, ``*byteorder`` is set to the current byte order at the end
11777db96d56Sopenharmony_ci   of input data.
11787db96d56Sopenharmony_ci
11797db96d56Sopenharmony_ci   If *byteorder* is ``NULL``, the codec starts in native order mode.
11807db96d56Sopenharmony_ci
11817db96d56Sopenharmony_ci   Return ``NULL`` if an exception was raised by the codec.
11827db96d56Sopenharmony_ci
11837db96d56Sopenharmony_ci
11847db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \
11857db96d56Sopenharmony_ci                              const char *errors, int *byteorder, Py_ssize_t *consumed)
11867db96d56Sopenharmony_ci
11877db96d56Sopenharmony_ci   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If
11887db96d56Sopenharmony_ci   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
11897db96d56Sopenharmony_ci   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
11907db96d56Sopenharmony_ci   split surrogate pair) as an error. Those bytes will not be decoded and the
11917db96d56Sopenharmony_ci   number of bytes that have been decoded will be stored in *consumed*.
11927db96d56Sopenharmony_ci
11937db96d56Sopenharmony_ci
11947db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
11957db96d56Sopenharmony_ci
11967db96d56Sopenharmony_ci   Return a Python byte string using the UTF-16 encoding in native byte
11977db96d56Sopenharmony_ci   order. The string always starts with a BOM mark.  Error handling is "strict".
11987db96d56Sopenharmony_ci   Return ``NULL`` if an exception was raised by the codec.
11997db96d56Sopenharmony_ci
12007db96d56Sopenharmony_ci
12017db96d56Sopenharmony_ciUTF-7 Codecs
12027db96d56Sopenharmony_ci""""""""""""
12037db96d56Sopenharmony_ci
12047db96d56Sopenharmony_ciThese are the UTF-7 codec APIs:
12057db96d56Sopenharmony_ci
12067db96d56Sopenharmony_ci
12077db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
12087db96d56Sopenharmony_ci
12097db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
12107db96d56Sopenharmony_ci   *s*.  Return ``NULL`` if an exception was raised by the codec.
12117db96d56Sopenharmony_ci
12127db96d56Sopenharmony_ci
12137db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \
12147db96d56Sopenharmony_ci                              const char *errors, Py_ssize_t *consumed)
12157db96d56Sopenharmony_ci
12167db96d56Sopenharmony_ci   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
12177db96d56Sopenharmony_ci   *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not
12187db96d56Sopenharmony_ci   be treated as an error.  Those bytes will not be decoded and the number of
12197db96d56Sopenharmony_ci   bytes that have been decoded will be stored in *consumed*.
12207db96d56Sopenharmony_ci
12217db96d56Sopenharmony_ci
12227db96d56Sopenharmony_ciUnicode-Escape Codecs
12237db96d56Sopenharmony_ci"""""""""""""""""""""
12247db96d56Sopenharmony_ci
12257db96d56Sopenharmony_ciThese are the "Unicode Escape" codec APIs:
12267db96d56Sopenharmony_ci
12277db96d56Sopenharmony_ci
12287db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \
12297db96d56Sopenharmony_ci                              Py_ssize_t size, const char *errors)
12307db96d56Sopenharmony_ci
12317db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
12327db96d56Sopenharmony_ci   string *s*.  Return ``NULL`` if an exception was raised by the codec.
12337db96d56Sopenharmony_ci
12347db96d56Sopenharmony_ci
12357db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
12367db96d56Sopenharmony_ci
12377db96d56Sopenharmony_ci   Encode a Unicode object using Unicode-Escape and return the result as a
12387db96d56Sopenharmony_ci   bytes object.  Error handling is "strict".  Return ``NULL`` if an exception was
12397db96d56Sopenharmony_ci   raised by the codec.
12407db96d56Sopenharmony_ci
12417db96d56Sopenharmony_ci
12427db96d56Sopenharmony_ciRaw-Unicode-Escape Codecs
12437db96d56Sopenharmony_ci"""""""""""""""""""""""""
12447db96d56Sopenharmony_ci
12457db96d56Sopenharmony_ciThese are the "Raw Unicode Escape" codec APIs:
12467db96d56Sopenharmony_ci
12477db96d56Sopenharmony_ci
12487db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \
12497db96d56Sopenharmony_ci                              Py_ssize_t size, const char *errors)
12507db96d56Sopenharmony_ci
12517db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
12527db96d56Sopenharmony_ci   encoded string *s*.  Return ``NULL`` if an exception was raised by the codec.
12537db96d56Sopenharmony_ci
12547db96d56Sopenharmony_ci
12557db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
12567db96d56Sopenharmony_ci
12577db96d56Sopenharmony_ci   Encode a Unicode object using Raw-Unicode-Escape and return the result as
12587db96d56Sopenharmony_ci   a bytes object.  Error handling is "strict".  Return ``NULL`` if an exception
12597db96d56Sopenharmony_ci   was raised by the codec.
12607db96d56Sopenharmony_ci
12617db96d56Sopenharmony_ci
12627db96d56Sopenharmony_ciLatin-1 Codecs
12637db96d56Sopenharmony_ci""""""""""""""
12647db96d56Sopenharmony_ci
12657db96d56Sopenharmony_ciThese are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
12667db96d56Sopenharmony_ciordinals and only these are accepted by the codecs during encoding.
12677db96d56Sopenharmony_ci
12687db96d56Sopenharmony_ci
12697db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
12707db96d56Sopenharmony_ci
12717db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
12727db96d56Sopenharmony_ci   *s*.  Return ``NULL`` if an exception was raised by the codec.
12737db96d56Sopenharmony_ci
12747db96d56Sopenharmony_ci
12757db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
12767db96d56Sopenharmony_ci
12777db96d56Sopenharmony_ci   Encode a Unicode object using Latin-1 and return the result as Python bytes
12787db96d56Sopenharmony_ci   object.  Error handling is "strict".  Return ``NULL`` if an exception was
12797db96d56Sopenharmony_ci   raised by the codec.
12807db96d56Sopenharmony_ci
12817db96d56Sopenharmony_ci
12827db96d56Sopenharmony_ciASCII Codecs
12837db96d56Sopenharmony_ci""""""""""""
12847db96d56Sopenharmony_ci
12857db96d56Sopenharmony_ciThese are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
12867db96d56Sopenharmony_cicodes generate errors.
12877db96d56Sopenharmony_ci
12887db96d56Sopenharmony_ci
12897db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
12907db96d56Sopenharmony_ci
12917db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
12927db96d56Sopenharmony_ci   *s*.  Return ``NULL`` if an exception was raised by the codec.
12937db96d56Sopenharmony_ci
12947db96d56Sopenharmony_ci
12957db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
12967db96d56Sopenharmony_ci
12977db96d56Sopenharmony_ci   Encode a Unicode object using ASCII and return the result as Python bytes
12987db96d56Sopenharmony_ci   object.  Error handling is "strict".  Return ``NULL`` if an exception was
12997db96d56Sopenharmony_ci   raised by the codec.
13007db96d56Sopenharmony_ci
13017db96d56Sopenharmony_ci
13027db96d56Sopenharmony_ciCharacter Map Codecs
13037db96d56Sopenharmony_ci""""""""""""""""""""
13047db96d56Sopenharmony_ci
13057db96d56Sopenharmony_ciThis codec is special in that it can be used to implement many different codecs
13067db96d56Sopenharmony_ci(and this is in fact what was done to obtain most of the standard codecs
13077db96d56Sopenharmony_ciincluded in the :mod:`encodings` package). The codec uses mappings to encode and
13087db96d56Sopenharmony_cidecode characters.  The mapping objects provided must support the
13097db96d56Sopenharmony_ci:meth:`__getitem__` mapping interface; dictionaries and sequences work well.
13107db96d56Sopenharmony_ci
13117db96d56Sopenharmony_ciThese are the mapping codec APIs:
13127db96d56Sopenharmony_ci
13137db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \
13147db96d56Sopenharmony_ci                              PyObject *mapping, const char *errors)
13157db96d56Sopenharmony_ci
13167db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the encoded string *s*
13177db96d56Sopenharmony_ci   using the given *mapping* object.  Return ``NULL`` if an exception was raised
13187db96d56Sopenharmony_ci   by the codec.
13197db96d56Sopenharmony_ci
13207db96d56Sopenharmony_ci   If *mapping* is ``NULL``, Latin-1 decoding will be applied.  Else
13217db96d56Sopenharmony_ci   *mapping* must map bytes ordinals (integers in the range from 0 to 255)
13227db96d56Sopenharmony_ci   to Unicode strings, integers (which are then interpreted as Unicode
13237db96d56Sopenharmony_ci   ordinals) or ``None``.  Unmapped data bytes -- ones which cause a
13247db96d56Sopenharmony_ci   :exc:`LookupError`, as well as ones which get mapped to ``None``,
13257db96d56Sopenharmony_ci   ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause
13267db96d56Sopenharmony_ci   an error.
13277db96d56Sopenharmony_ci
13287db96d56Sopenharmony_ci
13297db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
13307db96d56Sopenharmony_ci
13317db96d56Sopenharmony_ci   Encode a Unicode object using the given *mapping* object and return the
13327db96d56Sopenharmony_ci   result as a bytes object.  Error handling is "strict".  Return ``NULL`` if an
13337db96d56Sopenharmony_ci   exception was raised by the codec.
13347db96d56Sopenharmony_ci
13357db96d56Sopenharmony_ci   The *mapping* object must map Unicode ordinal integers to bytes objects,
13367db96d56Sopenharmony_ci   integers in the range from 0 to 255 or ``None``.  Unmapped character
13377db96d56Sopenharmony_ci   ordinals (ones which cause a :exc:`LookupError`) as well as mapped to
13387db96d56Sopenharmony_ci   ``None`` are treated as "undefined mapping" and cause an error.
13397db96d56Sopenharmony_ci
13407db96d56Sopenharmony_ci
13417db96d56Sopenharmony_ciThe following codec API is special in that maps Unicode to Unicode.
13427db96d56Sopenharmony_ci
13437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
13447db96d56Sopenharmony_ci
13457db96d56Sopenharmony_ci   Translate a string by applying a character mapping table to it and return the
13467db96d56Sopenharmony_ci   resulting Unicode object. Return ``NULL`` if an exception was raised by the
13477db96d56Sopenharmony_ci   codec.
13487db96d56Sopenharmony_ci
13497db96d56Sopenharmony_ci   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
13507db96d56Sopenharmony_ci   or ``None`` (causing deletion of the character).
13517db96d56Sopenharmony_ci
13527db96d56Sopenharmony_ci   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
13537db96d56Sopenharmony_ci   and sequences work well.  Unmapped character ordinals (ones which cause a
13547db96d56Sopenharmony_ci   :exc:`LookupError`) are left untouched and are copied as-is.
13557db96d56Sopenharmony_ci
13567db96d56Sopenharmony_ci   *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to
13577db96d56Sopenharmony_ci   use the default error handling.
13587db96d56Sopenharmony_ci
13597db96d56Sopenharmony_ci
13607db96d56Sopenharmony_ciMBCS codecs for Windows
13617db96d56Sopenharmony_ci"""""""""""""""""""""""
13627db96d56Sopenharmony_ci
13637db96d56Sopenharmony_ciThese are the MBCS codec APIs. They are currently only available on Windows and
13647db96d56Sopenharmony_ciuse the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
13657db96d56Sopenharmony_ciDBCS) is a class of encodings, not just one.  The target encoding is defined by
13667db96d56Sopenharmony_cithe user settings on the machine running the codec.
13677db96d56Sopenharmony_ci
13687db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
13697db96d56Sopenharmony_ci
13707db96d56Sopenharmony_ci   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
13717db96d56Sopenharmony_ci   Return ``NULL`` if an exception was raised by the codec.
13727db96d56Sopenharmony_ci
13737db96d56Sopenharmony_ci
13747db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \
13757db96d56Sopenharmony_ci                              const char *errors, Py_ssize_t *consumed)
13767db96d56Sopenharmony_ci
13777db96d56Sopenharmony_ci   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If
13787db96d56Sopenharmony_ci   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
13797db96d56Sopenharmony_ci   trailing lead byte and the number of bytes that have been decoded will be stored
13807db96d56Sopenharmony_ci   in *consumed*.
13817db96d56Sopenharmony_ci
13827db96d56Sopenharmony_ci
13837db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
13847db96d56Sopenharmony_ci
13857db96d56Sopenharmony_ci   Encode a Unicode object using MBCS and return the result as Python bytes
13867db96d56Sopenharmony_ci   object.  Error handling is "strict".  Return ``NULL`` if an exception was
13877db96d56Sopenharmony_ci   raised by the codec.
13887db96d56Sopenharmony_ci
13897db96d56Sopenharmony_ci
13907db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors)
13917db96d56Sopenharmony_ci
13927db96d56Sopenharmony_ci   Encode the Unicode object using the specified code page and return a Python
13937db96d56Sopenharmony_ci   bytes object.  Return ``NULL`` if an exception was raised by the codec. Use
13947db96d56Sopenharmony_ci   :c:data:`CP_ACP` code page to get the MBCS encoder.
13957db96d56Sopenharmony_ci
13967db96d56Sopenharmony_ci   .. versionadded:: 3.3
13977db96d56Sopenharmony_ci
13987db96d56Sopenharmony_ci
13997db96d56Sopenharmony_ciMethods & Slots
14007db96d56Sopenharmony_ci"""""""""""""""
14017db96d56Sopenharmony_ci
14027db96d56Sopenharmony_ci
14037db96d56Sopenharmony_ci.. _unicodemethodsandslots:
14047db96d56Sopenharmony_ci
14057db96d56Sopenharmony_ciMethods and Slot Functions
14067db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^
14077db96d56Sopenharmony_ci
14087db96d56Sopenharmony_ciThe following APIs are capable of handling Unicode objects and strings on input
14097db96d56Sopenharmony_ci(we refer to them as strings in the descriptions) and return Unicode objects or
14107db96d56Sopenharmony_ciintegers as appropriate.
14117db96d56Sopenharmony_ci
14127db96d56Sopenharmony_ciThey all return ``NULL`` or ``-1`` if an exception occurs.
14137db96d56Sopenharmony_ci
14147db96d56Sopenharmony_ci
14157db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
14167db96d56Sopenharmony_ci
14177db96d56Sopenharmony_ci   Concat two strings giving a new Unicode string.
14187db96d56Sopenharmony_ci
14197db96d56Sopenharmony_ci
14207db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
14217db96d56Sopenharmony_ci
14227db96d56Sopenharmony_ci   Split a string giving a list of Unicode strings.  If *sep* is ``NULL``, splitting
14237db96d56Sopenharmony_ci   will be done at all whitespace substrings.  Otherwise, splits occur at the given
14247db96d56Sopenharmony_ci   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
14257db96d56Sopenharmony_ci   set.  Separators are not included in the resulting list.
14267db96d56Sopenharmony_ci
14277db96d56Sopenharmony_ci
14287db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
14297db96d56Sopenharmony_ci
14307db96d56Sopenharmony_ci   Split a Unicode string at line breaks, returning a list of Unicode strings.
14317db96d56Sopenharmony_ci   CRLF is considered to be one line break.  If *keepend* is ``0``, the line break
14327db96d56Sopenharmony_ci   characters are not included in the resulting strings.
14337db96d56Sopenharmony_ci
14347db96d56Sopenharmony_ci
14357db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
14367db96d56Sopenharmony_ci
14377db96d56Sopenharmony_ci   Join a sequence of strings using the given *separator* and return the resulting
14387db96d56Sopenharmony_ci   Unicode string.
14397db96d56Sopenharmony_ci
14407db96d56Sopenharmony_ci
14417db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
14427db96d56Sopenharmony_ci                        Py_ssize_t start, Py_ssize_t end, int direction)
14437db96d56Sopenharmony_ci
14447db96d56Sopenharmony_ci   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
14457db96d56Sopenharmony_ci   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
14467db96d56Sopenharmony_ci   ``0`` otherwise. Return ``-1`` if an error occurred.
14477db96d56Sopenharmony_ci
14487db96d56Sopenharmony_ci
14497db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \
14507db96d56Sopenharmony_ci                               Py_ssize_t start, Py_ssize_t end, int direction)
14517db96d56Sopenharmony_ci
14527db96d56Sopenharmony_ci   Return the first position of *substr* in ``str[start:end]`` using the given
14537db96d56Sopenharmony_ci   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
14547db96d56Sopenharmony_ci   backward search).  The return value is the index of the first match; a value of
14557db96d56Sopenharmony_ci   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
14567db96d56Sopenharmony_ci   occurred and an exception has been set.
14577db96d56Sopenharmony_ci
14587db96d56Sopenharmony_ci
14597db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \
14607db96d56Sopenharmony_ci                               Py_ssize_t start, Py_ssize_t end, int direction)
14617db96d56Sopenharmony_ci
14627db96d56Sopenharmony_ci   Return the first position of the character *ch* in ``str[start:end]`` using
14637db96d56Sopenharmony_ci   the given *direction* (*direction* == ``1`` means to do a forward search,
14647db96d56Sopenharmony_ci   *direction* == ``-1`` a backward search).  The return value is the index of the
14657db96d56Sopenharmony_ci   first match; a value of ``-1`` indicates that no match was found, and ``-2``
14667db96d56Sopenharmony_ci   indicates that an error occurred and an exception has been set.
14677db96d56Sopenharmony_ci
14687db96d56Sopenharmony_ci   .. versionadded:: 3.3
14697db96d56Sopenharmony_ci
14707db96d56Sopenharmony_ci   .. versionchanged:: 3.7
14717db96d56Sopenharmony_ci      *start* and *end* are now adjusted to behave like ``str[start:end]``.
14727db96d56Sopenharmony_ci
14737db96d56Sopenharmony_ci
14747db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \
14757db96d56Sopenharmony_ci                               Py_ssize_t start, Py_ssize_t end)
14767db96d56Sopenharmony_ci
14777db96d56Sopenharmony_ci   Return the number of non-overlapping occurrences of *substr* in
14787db96d56Sopenharmony_ci   ``str[start:end]``.  Return ``-1`` if an error occurred.
14797db96d56Sopenharmony_ci
14807db96d56Sopenharmony_ci
14817db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \
14827db96d56Sopenharmony_ci                              PyObject *replstr, Py_ssize_t maxcount)
14837db96d56Sopenharmony_ci
14847db96d56Sopenharmony_ci   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
14857db96d56Sopenharmony_ci   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
14867db96d56Sopenharmony_ci   occurrences.
14877db96d56Sopenharmony_ci
14887db96d56Sopenharmony_ci
14897db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
14907db96d56Sopenharmony_ci
14917db96d56Sopenharmony_ci   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
14927db96d56Sopenharmony_ci   respectively.
14937db96d56Sopenharmony_ci
14947db96d56Sopenharmony_ci   This function returns ``-1`` upon failure, so one should call
14957db96d56Sopenharmony_ci   :c:func:`PyErr_Occurred` to check for errors.
14967db96d56Sopenharmony_ci
14977db96d56Sopenharmony_ci
14987db96d56Sopenharmony_ci.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)
14997db96d56Sopenharmony_ci
15007db96d56Sopenharmony_ci   Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
15017db96d56Sopenharmony_ci   than, equal, and greater than, respectively. It is best to pass only
15027db96d56Sopenharmony_ci   ASCII-encoded strings, but the function interprets the input string as
15037db96d56Sopenharmony_ci   ISO-8859-1 if it contains non-ASCII characters.
15047db96d56Sopenharmony_ci
15057db96d56Sopenharmony_ci   This function does not raise exceptions.
15067db96d56Sopenharmony_ci
15077db96d56Sopenharmony_ci
15087db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
15097db96d56Sopenharmony_ci
15107db96d56Sopenharmony_ci   Rich compare two Unicode strings and return one of the following:
15117db96d56Sopenharmony_ci
15127db96d56Sopenharmony_ci   * ``NULL`` in case an exception was raised
15137db96d56Sopenharmony_ci   * :const:`Py_True` or :const:`Py_False` for successful comparisons
15147db96d56Sopenharmony_ci   * :const:`Py_NotImplemented` in case the type combination is unknown
15157db96d56Sopenharmony_ci
15167db96d56Sopenharmony_ci   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
15177db96d56Sopenharmony_ci   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
15187db96d56Sopenharmony_ci
15197db96d56Sopenharmony_ci
15207db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
15217db96d56Sopenharmony_ci
15227db96d56Sopenharmony_ci   Return a new string object from *format* and *args*; this is analogous to
15237db96d56Sopenharmony_ci   ``format % args``.
15247db96d56Sopenharmony_ci
15257db96d56Sopenharmony_ci
15267db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
15277db96d56Sopenharmony_ci
15287db96d56Sopenharmony_ci   Check whether *element* is contained in *container* and return true or false
15297db96d56Sopenharmony_ci   accordingly.
15307db96d56Sopenharmony_ci
15317db96d56Sopenharmony_ci   *element* has to coerce to a one element Unicode string. ``-1`` is returned
15327db96d56Sopenharmony_ci   if there was an error.
15337db96d56Sopenharmony_ci
15347db96d56Sopenharmony_ci
15357db96d56Sopenharmony_ci.. c:function:: void PyUnicode_InternInPlace(PyObject **string)
15367db96d56Sopenharmony_ci
15377db96d56Sopenharmony_ci   Intern the argument *\*string* in place.  The argument must be the address of a
15387db96d56Sopenharmony_ci   pointer variable pointing to a Python Unicode string object.  If there is an
15397db96d56Sopenharmony_ci   existing interned string that is the same as *\*string*, it sets *\*string* to
15407db96d56Sopenharmony_ci   it (decrementing the reference count of the old string object and incrementing
15417db96d56Sopenharmony_ci   the reference count of the interned string object), otherwise it leaves
15427db96d56Sopenharmony_ci   *\*string* alone and interns it (incrementing its reference count).
15437db96d56Sopenharmony_ci   (Clarification: even though there is a lot of talk about reference counts, think
15447db96d56Sopenharmony_ci   of this function as reference-count-neutral; you own the object after the call
15457db96d56Sopenharmony_ci   if and only if you owned it before the call.)
15467db96d56Sopenharmony_ci
15477db96d56Sopenharmony_ci
15487db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_InternFromString(const char *v)
15497db96d56Sopenharmony_ci
15507db96d56Sopenharmony_ci   A combination of :c:func:`PyUnicode_FromString` and
15517db96d56Sopenharmony_ci   :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string
15527db96d56Sopenharmony_ci   object that has been interned, or a new ("owned") reference to an earlier
15537db96d56Sopenharmony_ci   interned string object with the same value.
1554