17db96d56Sopenharmony_ci.. highlight:: c 27db96d56Sopenharmony_ci 37db96d56Sopenharmony_ci.. _unicodeobjects: 47db96d56Sopenharmony_ci 57db96d56Sopenharmony_ciUnicode Objects and Codecs 67db96d56Sopenharmony_ci-------------------------- 77db96d56Sopenharmony_ci 87db96d56Sopenharmony_ci.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 97db96d56Sopenharmony_ci.. sectionauthor:: Georg Brandl <georg@python.org> 107db96d56Sopenharmony_ci 117db96d56Sopenharmony_ciUnicode Objects 127db96d56Sopenharmony_ci^^^^^^^^^^^^^^^ 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ciSince the implementation of :pep:`393` in Python 3.3, Unicode objects internally 157db96d56Sopenharmony_ciuse a variety of representations, in order to allow handling the complete range 167db96d56Sopenharmony_ciof Unicode characters while staying memory efficient. There are special cases 177db96d56Sopenharmony_cifor strings where all code points are below 128, 256, or 65536; otherwise, code 187db96d56Sopenharmony_cipoints must be below 1114112 (which is the full Unicode range). 197db96d56Sopenharmony_ci 207db96d56Sopenharmony_ci:c:expr:`Py_UNICODE*` and UTF-8 representations are created on demand and cached 217db96d56Sopenharmony_ciin the Unicode object. The :c:expr:`Py_UNICODE*` representation is deprecated 227db96d56Sopenharmony_ciand inefficient. 237db96d56Sopenharmony_ci 247db96d56Sopenharmony_ciDue to the transition between the old APIs and the new APIs, Unicode objects 257db96d56Sopenharmony_cican internally be in two states depending on how they were created: 267db96d56Sopenharmony_ci 277db96d56Sopenharmony_ci* "canonical" Unicode objects are all objects created by a non-deprecated 287db96d56Sopenharmony_ci Unicode API. They use the most efficient representation allowed by the 297db96d56Sopenharmony_ci implementation. 307db96d56Sopenharmony_ci 317db96d56Sopenharmony_ci* "legacy" Unicode objects have been created through one of the deprecated 327db96d56Sopenharmony_ci APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the 337db96d56Sopenharmony_ci :c:expr:`Py_UNICODE*` representation; you will have to call 347db96d56Sopenharmony_ci :c:func:`PyUnicode_READY` on them before calling any other API. 357db96d56Sopenharmony_ci 367db96d56Sopenharmony_ci.. note:: 377db96d56Sopenharmony_ci The "legacy" Unicode object will be removed in Python 3.12 with deprecated 387db96d56Sopenharmony_ci APIs. All Unicode objects will be "canonical" since then. See :pep:`623` 397db96d56Sopenharmony_ci for more information. 407db96d56Sopenharmony_ci 417db96d56Sopenharmony_ci 427db96d56Sopenharmony_ciUnicode Type 437db96d56Sopenharmony_ci"""""""""""" 447db96d56Sopenharmony_ci 457db96d56Sopenharmony_ciThese are the basic Unicode object types used for the Unicode implementation in 467db96d56Sopenharmony_ciPython: 477db96d56Sopenharmony_ci 487db96d56Sopenharmony_ci.. c:type:: Py_UCS4 497db96d56Sopenharmony_ci Py_UCS2 507db96d56Sopenharmony_ci Py_UCS1 517db96d56Sopenharmony_ci 527db96d56Sopenharmony_ci These types are typedefs for unsigned integer types wide enough to contain 537db96d56Sopenharmony_ci characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with 547db96d56Sopenharmony_ci single Unicode characters, use :c:type:`Py_UCS4`. 557db96d56Sopenharmony_ci 567db96d56Sopenharmony_ci .. versionadded:: 3.3 577db96d56Sopenharmony_ci 587db96d56Sopenharmony_ci 597db96d56Sopenharmony_ci.. c:type:: Py_UNICODE 607db96d56Sopenharmony_ci 617db96d56Sopenharmony_ci This is a typedef of :c:expr:`wchar_t`, which is a 16-bit type or 32-bit type 627db96d56Sopenharmony_ci depending on the platform. 637db96d56Sopenharmony_ci 647db96d56Sopenharmony_ci .. versionchanged:: 3.3 657db96d56Sopenharmony_ci In previous versions, this was a 16-bit type or a 32-bit type depending on 667db96d56Sopenharmony_ci whether you selected a "narrow" or "wide" Unicode version of Python at 677db96d56Sopenharmony_ci build time. 687db96d56Sopenharmony_ci 697db96d56Sopenharmony_ci 707db96d56Sopenharmony_ci.. c:type:: PyASCIIObject 717db96d56Sopenharmony_ci PyCompactUnicodeObject 727db96d56Sopenharmony_ci PyUnicodeObject 737db96d56Sopenharmony_ci 747db96d56Sopenharmony_ci These subtypes of :c:type:`PyObject` represent a Python Unicode object. In 757db96d56Sopenharmony_ci almost all cases, they shouldn't be used directly, since all API functions 767db96d56Sopenharmony_ci that deal with Unicode objects take and return :c:type:`PyObject` pointers. 777db96d56Sopenharmony_ci 787db96d56Sopenharmony_ci .. versionadded:: 3.3 797db96d56Sopenharmony_ci 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci.. c:var:: PyTypeObject PyUnicode_Type 827db96d56Sopenharmony_ci 837db96d56Sopenharmony_ci This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 847db96d56Sopenharmony_ci is exposed to Python code as ``str``. 857db96d56Sopenharmony_ci 867db96d56Sopenharmony_ci 877db96d56Sopenharmony_ciThe following APIs are C macros and static inlined functions for fast checks and 887db96d56Sopenharmony_ciaccess to internal read-only data of Unicode objects: 897db96d56Sopenharmony_ci 907db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Check(PyObject *o) 917db96d56Sopenharmony_ci 927db96d56Sopenharmony_ci Return true if the object *o* is a Unicode object or an instance of a Unicode 937db96d56Sopenharmony_ci subtype. This function always succeeds. 947db96d56Sopenharmony_ci 957db96d56Sopenharmony_ci 967db96d56Sopenharmony_ci.. c:function:: int PyUnicode_CheckExact(PyObject *o) 977db96d56Sopenharmony_ci 987db96d56Sopenharmony_ci Return true if the object *o* is a Unicode object, but not an instance of a 997db96d56Sopenharmony_ci subtype. This function always succeeds. 1007db96d56Sopenharmony_ci 1017db96d56Sopenharmony_ci 1027db96d56Sopenharmony_ci.. c:function:: int PyUnicode_READY(PyObject *o) 1037db96d56Sopenharmony_ci 1047db96d56Sopenharmony_ci Ensure the string object *o* is in the "canonical" representation. This is 1057db96d56Sopenharmony_ci required before using any of the access macros described below. 1067db96d56Sopenharmony_ci 1077db96d56Sopenharmony_ci .. XXX expand on when it is not required 1087db96d56Sopenharmony_ci 1097db96d56Sopenharmony_ci Returns ``0`` on success and ``-1`` with an exception set on failure, which in 1107db96d56Sopenharmony_ci particular happens if memory allocation fails. 1117db96d56Sopenharmony_ci 1127db96d56Sopenharmony_ci .. versionadded:: 3.3 1137db96d56Sopenharmony_ci 1147db96d56Sopenharmony_ci .. deprecated-removed:: 3.10 3.12 1157db96d56Sopenharmony_ci This API will be removed with :c:func:`PyUnicode_FromUnicode`. 1167db96d56Sopenharmony_ci 1177db96d56Sopenharmony_ci 1187db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) 1197db96d56Sopenharmony_ci 1207db96d56Sopenharmony_ci Return the length of the Unicode string, in code points. *o* has to be a 1217db96d56Sopenharmony_ci Unicode object in the "canonical" representation (not checked). 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ci .. versionadded:: 3.3 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci 1267db96d56Sopenharmony_ci.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) 1277db96d56Sopenharmony_ci Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) 1287db96d56Sopenharmony_ci Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) 1297db96d56Sopenharmony_ci 1307db96d56Sopenharmony_ci Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 1317db96d56Sopenharmony_ci integer types for direct character access. No checks are performed if the 1327db96d56Sopenharmony_ci canonical representation has the correct character size; use 1337db96d56Sopenharmony_ci :c:func:`PyUnicode_KIND` to select the right macro. Make sure 1347db96d56Sopenharmony_ci :c:func:`PyUnicode_READY` has been called before accessing this. 1357db96d56Sopenharmony_ci 1367db96d56Sopenharmony_ci .. versionadded:: 3.3 1377db96d56Sopenharmony_ci 1387db96d56Sopenharmony_ci 1397db96d56Sopenharmony_ci.. c:macro:: PyUnicode_WCHAR_KIND 1407db96d56Sopenharmony_ci PyUnicode_1BYTE_KIND 1417db96d56Sopenharmony_ci PyUnicode_2BYTE_KIND 1427db96d56Sopenharmony_ci PyUnicode_4BYTE_KIND 1437db96d56Sopenharmony_ci 1447db96d56Sopenharmony_ci Return values of the :c:func:`PyUnicode_KIND` macro. 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci .. versionadded:: 3.3 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci .. deprecated-removed:: 3.10 3.12 1497db96d56Sopenharmony_ci ``PyUnicode_WCHAR_KIND`` is deprecated. 1507db96d56Sopenharmony_ci 1517db96d56Sopenharmony_ci 1527db96d56Sopenharmony_ci.. c:function:: int PyUnicode_KIND(PyObject *o) 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ci Return one of the PyUnicode kind constants (see above) that indicate how many 1557db96d56Sopenharmony_ci bytes per character this Unicode object uses to store its data. *o* has to 1567db96d56Sopenharmony_ci be a Unicode object in the "canonical" representation (not checked). 1577db96d56Sopenharmony_ci 1587db96d56Sopenharmony_ci .. XXX document "0" return value? 1597db96d56Sopenharmony_ci 1607db96d56Sopenharmony_ci .. versionadded:: 3.3 1617db96d56Sopenharmony_ci 1627db96d56Sopenharmony_ci 1637db96d56Sopenharmony_ci.. c:function:: void* PyUnicode_DATA(PyObject *o) 1647db96d56Sopenharmony_ci 1657db96d56Sopenharmony_ci Return a void pointer to the raw Unicode buffer. *o* has to be a Unicode 1667db96d56Sopenharmony_ci object in the "canonical" representation (not checked). 1677db96d56Sopenharmony_ci 1687db96d56Sopenharmony_ci .. versionadded:: 3.3 1697db96d56Sopenharmony_ci 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci.. c:function:: void PyUnicode_WRITE(int kind, void *data, \ 1727db96d56Sopenharmony_ci Py_ssize_t index, Py_UCS4 value) 1737db96d56Sopenharmony_ci 1747db96d56Sopenharmony_ci Write into a canonical representation *data* (as obtained with 1757db96d56Sopenharmony_ci :c:func:`PyUnicode_DATA`). This function performs no sanity checks, and is 1767db96d56Sopenharmony_ci intended for usage in loops. The caller should cache the *kind* value and 1777db96d56Sopenharmony_ci *data* pointer as obtained from other calls. *index* is the index in 1787db96d56Sopenharmony_ci the string (starts at 0) and *value* is the new code point value which should 1797db96d56Sopenharmony_ci be written to that location. 1807db96d56Sopenharmony_ci 1817db96d56Sopenharmony_ci .. versionadded:: 3.3 1827db96d56Sopenharmony_ci 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, \ 1857db96d56Sopenharmony_ci Py_ssize_t index) 1867db96d56Sopenharmony_ci 1877db96d56Sopenharmony_ci Read a code point from a canonical representation *data* (as obtained with 1887db96d56Sopenharmony_ci :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. 1897db96d56Sopenharmony_ci 1907db96d56Sopenharmony_ci .. versionadded:: 3.3 1917db96d56Sopenharmony_ci 1927db96d56Sopenharmony_ci 1937db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) 1947db96d56Sopenharmony_ci 1957db96d56Sopenharmony_ci Read a character from a Unicode object *o*, which must be in the "canonical" 1967db96d56Sopenharmony_ci representation. This is less efficient than :c:func:`PyUnicode_READ` if you 1977db96d56Sopenharmony_ci do multiple consecutive reads. 1987db96d56Sopenharmony_ci 1997db96d56Sopenharmony_ci .. versionadded:: 3.3 2007db96d56Sopenharmony_ci 2017db96d56Sopenharmony_ci 2027db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_MAX_CHAR_VALUE(PyObject *o) 2037db96d56Sopenharmony_ci 2047db96d56Sopenharmony_ci Return the maximum code point that is suitable for creating another string 2057db96d56Sopenharmony_ci based on *o*, which must be in the "canonical" representation. This is 2067db96d56Sopenharmony_ci always an approximation but more efficient than iterating over the string. 2077db96d56Sopenharmony_ci 2087db96d56Sopenharmony_ci .. versionadded:: 3.3 2097db96d56Sopenharmony_ci 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 2127db96d56Sopenharmony_ci 2137db96d56Sopenharmony_ci Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 2147db96d56Sopenharmony_ci code units (this includes surrogate pairs as 2 units). *o* has to be a 2157db96d56Sopenharmony_ci Unicode object (not checked). 2167db96d56Sopenharmony_ci 2177db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 2187db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 2197db96d56Sopenharmony_ci :c:func:`PyUnicode_GET_LENGTH`. 2207db96d56Sopenharmony_ci 2217db96d56Sopenharmony_ci 2227db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 2237db96d56Sopenharmony_ci 2247db96d56Sopenharmony_ci Return the size of the deprecated :c:type:`Py_UNICODE` representation in 2257db96d56Sopenharmony_ci bytes. *o* has to be a Unicode object (not checked). 2267db96d56Sopenharmony_ci 2277db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 2287db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 2297db96d56Sopenharmony_ci :c:func:`PyUnicode_GET_LENGTH`. 2307db96d56Sopenharmony_ci 2317db96d56Sopenharmony_ci 2327db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 2337db96d56Sopenharmony_ci const char* PyUnicode_AS_DATA(PyObject *o) 2347db96d56Sopenharmony_ci 2357db96d56Sopenharmony_ci Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The 2367db96d56Sopenharmony_ci returned buffer is always terminated with an extra null code point. It 2377db96d56Sopenharmony_ci may also contain embedded null code points, which would cause the string 2387db96d56Sopenharmony_ci to be truncated when used in most C functions. The ``AS_DATA`` form 2397db96d56Sopenharmony_ci casts the pointer to :c:expr:`const char *`. The *o* argument has to be 2407db96d56Sopenharmony_ci a Unicode object (not checked). 2417db96d56Sopenharmony_ci 2427db96d56Sopenharmony_ci .. versionchanged:: 3.3 2437db96d56Sopenharmony_ci This function is now inefficient -- because in many cases the 2447db96d56Sopenharmony_ci :c:type:`Py_UNICODE` representation does not exist and needs to be created 2457db96d56Sopenharmony_ci -- and can fail (return ``NULL`` with an exception set). Try to port the 2467db96d56Sopenharmony_ci code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use 2477db96d56Sopenharmony_ci :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. 2487db96d56Sopenharmony_ci 2497db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 2507db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using the 2517db96d56Sopenharmony_ci :c:func:`PyUnicode_nBYTE_DATA` family of macros. 2527db96d56Sopenharmony_ci 2537db96d56Sopenharmony_ci 2547db96d56Sopenharmony_ci.. c:function:: int PyUnicode_IsIdentifier(PyObject *o) 2557db96d56Sopenharmony_ci 2567db96d56Sopenharmony_ci Return ``1`` if the string is a valid identifier according to the language 2577db96d56Sopenharmony_ci definition, section :ref:`identifiers`. Return ``0`` otherwise. 2587db96d56Sopenharmony_ci 2597db96d56Sopenharmony_ci .. versionchanged:: 3.9 2607db96d56Sopenharmony_ci The function does not call :c:func:`Py_FatalError` anymore if the string 2617db96d56Sopenharmony_ci is not ready. 2627db96d56Sopenharmony_ci 2637db96d56Sopenharmony_ci 2647db96d56Sopenharmony_ciUnicode Character Properties 2657db96d56Sopenharmony_ci"""""""""""""""""""""""""""" 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ciUnicode provides many different character properties. The most often needed ones 2687db96d56Sopenharmony_ciare available through these macros which are mapped to C functions depending on 2697db96d56Sopenharmony_cithe Python configuration. 2707db96d56Sopenharmony_ci 2717db96d56Sopenharmony_ci 2727db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISSPACE(Py_UCS4 ch) 2737db96d56Sopenharmony_ci 2747db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 2757db96d56Sopenharmony_ci 2767db96d56Sopenharmony_ci 2777db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISLOWER(Py_UCS4 ch) 2787db96d56Sopenharmony_ci 2797db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 2807db96d56Sopenharmony_ci 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISUPPER(Py_UCS4 ch) 2837db96d56Sopenharmony_ci 2847db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 2857db96d56Sopenharmony_ci 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISTITLE(Py_UCS4 ch) 2887db96d56Sopenharmony_ci 2897db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 2907db96d56Sopenharmony_ci 2917db96d56Sopenharmony_ci 2927db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UCS4 ch) 2937db96d56Sopenharmony_ci 2947db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 2957db96d56Sopenharmony_ci 2967db96d56Sopenharmony_ci 2977db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UCS4 ch) 2987db96d56Sopenharmony_ci 2997db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 3007db96d56Sopenharmony_ci 3017db96d56Sopenharmony_ci 3027db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISDIGIT(Py_UCS4 ch) 3037db96d56Sopenharmony_ci 3047db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a digit character. 3057db96d56Sopenharmony_ci 3067db96d56Sopenharmony_ci 3077db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UCS4 ch) 3087db96d56Sopenharmony_ci 3097db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 3107db96d56Sopenharmony_ci 3117db96d56Sopenharmony_ci 3127db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISALPHA(Py_UCS4 ch) 3137db96d56Sopenharmony_ci 3147db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 3157db96d56Sopenharmony_ci 3167db96d56Sopenharmony_ci 3177db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISALNUM(Py_UCS4 ch) 3187db96d56Sopenharmony_ci 3197db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 3207db96d56Sopenharmony_ci 3217db96d56Sopenharmony_ci 3227db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UCS4 ch) 3237db96d56Sopenharmony_ci 3247db96d56Sopenharmony_ci Return ``1`` or ``0`` depending on whether *ch* is a printable character. 3257db96d56Sopenharmony_ci Nonprintable characters are those characters defined in the Unicode character 3267db96d56Sopenharmony_ci database as "Other" or "Separator", excepting the ASCII space (0x20) which is 3277db96d56Sopenharmony_ci considered printable. (Note that printable characters in this context are 3287db96d56Sopenharmony_ci those which should not be escaped when :func:`repr` is invoked on a string. 3297db96d56Sopenharmony_ci It has no bearing on the handling of strings written to :data:`sys.stdout` or 3307db96d56Sopenharmony_ci :data:`sys.stderr`.) 3317db96d56Sopenharmony_ci 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ciThese APIs can be used for fast direct character conversions: 3347db96d56Sopenharmony_ci 3357db96d56Sopenharmony_ci 3367db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOLOWER(Py_UCS4 ch) 3377db96d56Sopenharmony_ci 3387db96d56Sopenharmony_ci Return the character *ch* converted to lower case. 3397db96d56Sopenharmony_ci 3407db96d56Sopenharmony_ci .. deprecated:: 3.3 3417db96d56Sopenharmony_ci This function uses simple case mappings. 3427db96d56Sopenharmony_ci 3437db96d56Sopenharmony_ci 3447db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOUPPER(Py_UCS4 ch) 3457db96d56Sopenharmony_ci 3467db96d56Sopenharmony_ci Return the character *ch* converted to upper case. 3477db96d56Sopenharmony_ci 3487db96d56Sopenharmony_ci .. deprecated:: 3.3 3497db96d56Sopenharmony_ci This function uses simple case mappings. 3507db96d56Sopenharmony_ci 3517db96d56Sopenharmony_ci 3527db96d56Sopenharmony_ci.. c:function:: Py_UCS4 Py_UNICODE_TOTITLE(Py_UCS4 ch) 3537db96d56Sopenharmony_ci 3547db96d56Sopenharmony_ci Return the character *ch* converted to title case. 3557db96d56Sopenharmony_ci 3567db96d56Sopenharmony_ci .. deprecated:: 3.3 3577db96d56Sopenharmony_ci This function uses simple case mappings. 3587db96d56Sopenharmony_ci 3597db96d56Sopenharmony_ci 3607db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_TODECIMAL(Py_UCS4 ch) 3617db96d56Sopenharmony_ci 3627db96d56Sopenharmony_ci Return the character *ch* converted to a decimal positive integer. Return 3637db96d56Sopenharmony_ci ``-1`` if this is not possible. This macro does not raise exceptions. 3647db96d56Sopenharmony_ci 3657db96d56Sopenharmony_ci 3667db96d56Sopenharmony_ci.. c:function:: int Py_UNICODE_TODIGIT(Py_UCS4 ch) 3677db96d56Sopenharmony_ci 3687db96d56Sopenharmony_ci Return the character *ch* converted to a single digit integer. Return ``-1`` if 3697db96d56Sopenharmony_ci this is not possible. This macro does not raise exceptions. 3707db96d56Sopenharmony_ci 3717db96d56Sopenharmony_ci 3727db96d56Sopenharmony_ci.. c:function:: double Py_UNICODE_TONUMERIC(Py_UCS4 ch) 3737db96d56Sopenharmony_ci 3747db96d56Sopenharmony_ci Return the character *ch* converted to a double. Return ``-1.0`` if this is not 3757db96d56Sopenharmony_ci possible. This macro does not raise exceptions. 3767db96d56Sopenharmony_ci 3777db96d56Sopenharmony_ci 3787db96d56Sopenharmony_ciThese APIs can be used to work with surrogates: 3797db96d56Sopenharmony_ci 3807db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) 3817db96d56Sopenharmony_ci 3827db96d56Sopenharmony_ci Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). 3837db96d56Sopenharmony_ci 3847db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) 3857db96d56Sopenharmony_ci 3867db96d56Sopenharmony_ci Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``). 3877db96d56Sopenharmony_ci 3887db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) 3897db96d56Sopenharmony_ci 3907db96d56Sopenharmony_ci Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). 3917db96d56Sopenharmony_ci 3927db96d56Sopenharmony_ci.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) 3937db96d56Sopenharmony_ci 3947db96d56Sopenharmony_ci Join two surrogate characters and return a single Py_UCS4 value. 3957db96d56Sopenharmony_ci *high* and *low* are respectively the leading and trailing surrogates in a 3967db96d56Sopenharmony_ci surrogate pair. 3977db96d56Sopenharmony_ci 3987db96d56Sopenharmony_ci 3997db96d56Sopenharmony_ciCreating and accessing Unicode strings 4007db96d56Sopenharmony_ci"""""""""""""""""""""""""""""""""""""" 4017db96d56Sopenharmony_ci 4027db96d56Sopenharmony_ciTo create Unicode objects and access their basic sequence properties, use these 4037db96d56Sopenharmony_ciAPIs: 4047db96d56Sopenharmony_ci 4057db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) 4067db96d56Sopenharmony_ci 4077db96d56Sopenharmony_ci Create a new Unicode object. *maxchar* should be the true maximum code point 4087db96d56Sopenharmony_ci to be placed in the string. As an approximation, it can be rounded up to the 4097db96d56Sopenharmony_ci nearest value in the sequence 127, 255, 65535, 1114111. 4107db96d56Sopenharmony_ci 4117db96d56Sopenharmony_ci This is the recommended way to allocate a new Unicode object. Objects 4127db96d56Sopenharmony_ci created using this function are not resizable. 4137db96d56Sopenharmony_ci 4147db96d56Sopenharmony_ci .. versionadded:: 3.3 4157db96d56Sopenharmony_ci 4167db96d56Sopenharmony_ci 4177db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ 4187db96d56Sopenharmony_ci Py_ssize_t size) 4197db96d56Sopenharmony_ci 4207db96d56Sopenharmony_ci Create a new Unicode object with the given *kind* (possible values are 4217db96d56Sopenharmony_ci :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by 4227db96d56Sopenharmony_ci :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* 4237db96d56Sopenharmony_ci units of 1, 2 or 4 bytes per character, as given by the kind. 4247db96d56Sopenharmony_ci 4257db96d56Sopenharmony_ci If necessary, the input *buffer* is copied and transformed into the 4267db96d56Sopenharmony_ci canonical representation. For example, if the *buffer* is a UCS4 string 4277db96d56Sopenharmony_ci (:c:macro:`PyUnicode_4BYTE_KIND`) and it consists only of codepoints in 4287db96d56Sopenharmony_ci the UCS1 range, it will be transformed into UCS1 4297db96d56Sopenharmony_ci (:c:macro:`PyUnicode_1BYTE_KIND`). 4307db96d56Sopenharmony_ci 4317db96d56Sopenharmony_ci .. versionadded:: 3.3 4327db96d56Sopenharmony_ci 4337db96d56Sopenharmony_ci 4347db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 4357db96d56Sopenharmony_ci 4367db96d56Sopenharmony_ci Create a Unicode object from the char buffer *u*. The bytes will be 4377db96d56Sopenharmony_ci interpreted as being UTF-8 encoded. The buffer is copied into the new 4387db96d56Sopenharmony_ci object. If the buffer is not ``NULL``, the return value might be a shared 4397db96d56Sopenharmony_ci object, i.e. modification of the data is not allowed. 4407db96d56Sopenharmony_ci 4417db96d56Sopenharmony_ci If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode` 4427db96d56Sopenharmony_ci with the buffer set to ``NULL``. This usage is deprecated in favor of 4437db96d56Sopenharmony_ci :c:func:`PyUnicode_New`, and will be removed in Python 3.12. 4447db96d56Sopenharmony_ci 4457db96d56Sopenharmony_ci 4467db96d56Sopenharmony_ci.. c:function:: PyObject *PyUnicode_FromString(const char *u) 4477db96d56Sopenharmony_ci 4487db96d56Sopenharmony_ci Create a Unicode object from a UTF-8 encoded null-terminated char buffer 4497db96d56Sopenharmony_ci *u*. 4507db96d56Sopenharmony_ci 4517db96d56Sopenharmony_ci 4527db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 4537db96d56Sopenharmony_ci 4547db96d56Sopenharmony_ci Take a C :c:func:`printf`\ -style *format* string and a variable number of 4557db96d56Sopenharmony_ci arguments, calculate the size of the resulting Python Unicode string and return 4567db96d56Sopenharmony_ci a string with the values formatted into it. The variable arguments must be C 4577db96d56Sopenharmony_ci types and must correspond exactly to the format characters in the *format* 4587db96d56Sopenharmony_ci ASCII-encoded string. The following format characters are allowed: 4597db96d56Sopenharmony_ci 4607db96d56Sopenharmony_ci .. % This should be exactly the same as the table in PyErr_Format. 4617db96d56Sopenharmony_ci .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 4627db96d56Sopenharmony_ci .. % because not all compilers support the %z width modifier -- we fake it 4637db96d56Sopenharmony_ci .. % when necessary via interpolating PY_FORMAT_SIZE_T. 4647db96d56Sopenharmony_ci .. % Similar comments apply to the %ll width modifier and 4657db96d56Sopenharmony_ci 4667db96d56Sopenharmony_ci .. tabularcolumns:: |l|l|L| 4677db96d56Sopenharmony_ci 4687db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4697db96d56Sopenharmony_ci | Format Characters | Type | Comment | 4707db96d56Sopenharmony_ci +===================+=====================+==================================+ 4717db96d56Sopenharmony_ci | :attr:`%%` | *n/a* | The literal % character. | 4727db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4737db96d56Sopenharmony_ci | :attr:`%c` | int | A single character, | 4747db96d56Sopenharmony_ci | | | represented as a C int. | 4757db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4767db96d56Sopenharmony_ci | :attr:`%d` | int | Equivalent to | 4777db96d56Sopenharmony_ci | | | ``printf("%d")``. [1]_ | 4787db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4797db96d56Sopenharmony_ci | :attr:`%u` | unsigned int | Equivalent to | 4807db96d56Sopenharmony_ci | | | ``printf("%u")``. [1]_ | 4817db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4827db96d56Sopenharmony_ci | :attr:`%ld` | long | Equivalent to | 4837db96d56Sopenharmony_ci | | | ``printf("%ld")``. [1]_ | 4847db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4857db96d56Sopenharmony_ci | :attr:`%li` | long | Equivalent to | 4867db96d56Sopenharmony_ci | | | ``printf("%li")``. [1]_ | 4877db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4887db96d56Sopenharmony_ci | :attr:`%lu` | unsigned long | Equivalent to | 4897db96d56Sopenharmony_ci | | | ``printf("%lu")``. [1]_ | 4907db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4917db96d56Sopenharmony_ci | :attr:`%lld` | long long | Equivalent to | 4927db96d56Sopenharmony_ci | | | ``printf("%lld")``. [1]_ | 4937db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4947db96d56Sopenharmony_ci | :attr:`%lli` | long long | Equivalent to | 4957db96d56Sopenharmony_ci | | | ``printf("%lli")``. [1]_ | 4967db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 4977db96d56Sopenharmony_ci | :attr:`%llu` | unsigned long long | Equivalent to | 4987db96d56Sopenharmony_ci | | | ``printf("%llu")``. [1]_ | 4997db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5007db96d56Sopenharmony_ci | :attr:`%zd` | :c:type:`\ | Equivalent to | 5017db96d56Sopenharmony_ci | | Py_ssize_t` | ``printf("%zd")``. [1]_ | 5027db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5037db96d56Sopenharmony_ci | :attr:`%zi` | :c:type:`\ | Equivalent to | 5047db96d56Sopenharmony_ci | | Py_ssize_t` | ``printf("%zi")``. [1]_ | 5057db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5067db96d56Sopenharmony_ci | :attr:`%zu` | size_t | Equivalent to | 5077db96d56Sopenharmony_ci | | | ``printf("%zu")``. [1]_ | 5087db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5097db96d56Sopenharmony_ci | :attr:`%i` | int | Equivalent to | 5107db96d56Sopenharmony_ci | | | ``printf("%i")``. [1]_ | 5117db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5127db96d56Sopenharmony_ci | :attr:`%x` | int | Equivalent to | 5137db96d56Sopenharmony_ci | | | ``printf("%x")``. [1]_ | 5147db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5157db96d56Sopenharmony_ci | :attr:`%s` | const char\* | A null-terminated C character | 5167db96d56Sopenharmony_ci | | | array. | 5177db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5187db96d56Sopenharmony_ci | :attr:`%p` | const void\* | The hex representation of a C | 5197db96d56Sopenharmony_ci | | | pointer. Mostly equivalent to | 5207db96d56Sopenharmony_ci | | | ``printf("%p")`` except that | 5217db96d56Sopenharmony_ci | | | it is guaranteed to start with | 5227db96d56Sopenharmony_ci | | | the literal ``0x`` regardless | 5237db96d56Sopenharmony_ci | | | of what the platform's | 5247db96d56Sopenharmony_ci | | | ``printf`` yields. | 5257db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5267db96d56Sopenharmony_ci | :attr:`%A` | PyObject\* | The result of calling | 5277db96d56Sopenharmony_ci | | | :func:`ascii`. | 5287db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5297db96d56Sopenharmony_ci | :attr:`%U` | PyObject\* | A Unicode object. | 5307db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5317db96d56Sopenharmony_ci | :attr:`%V` | PyObject\*, | A Unicode object (which may be | 5327db96d56Sopenharmony_ci | | const char\* | ``NULL``) and a null-terminated | 5337db96d56Sopenharmony_ci | | | C character array as a second | 5347db96d56Sopenharmony_ci | | | parameter (which will be used, | 5357db96d56Sopenharmony_ci | | | if the first parameter is | 5367db96d56Sopenharmony_ci | | | ``NULL``). | 5377db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5387db96d56Sopenharmony_ci | :attr:`%S` | PyObject\* | The result of calling | 5397db96d56Sopenharmony_ci | | | :c:func:`PyObject_Str`. | 5407db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5417db96d56Sopenharmony_ci | :attr:`%R` | PyObject\* | The result of calling | 5427db96d56Sopenharmony_ci | | | :c:func:`PyObject_Repr`. | 5437db96d56Sopenharmony_ci +-------------------+---------------------+----------------------------------+ 5447db96d56Sopenharmony_ci 5457db96d56Sopenharmony_ci An unrecognized format character causes all the rest of the format string to be 5467db96d56Sopenharmony_ci copied as-is to the result string, and any extra arguments discarded. 5477db96d56Sopenharmony_ci 5487db96d56Sopenharmony_ci .. note:: 5497db96d56Sopenharmony_ci The width formatter unit is number of characters rather than bytes. 5507db96d56Sopenharmony_ci The precision formatter unit is number of bytes for ``"%s"`` and 5517db96d56Sopenharmony_ci ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of 5527db96d56Sopenharmony_ci characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"`` 5537db96d56Sopenharmony_ci (if the ``PyObject*`` argument is not ``NULL``). 5547db96d56Sopenharmony_ci 5557db96d56Sopenharmony_ci .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi, 5567db96d56Sopenharmony_ci zu, i, x): the 0-conversion flag has effect even when a precision is given. 5577db96d56Sopenharmony_ci 5587db96d56Sopenharmony_ci .. versionchanged:: 3.2 5597db96d56Sopenharmony_ci Support for ``"%lld"`` and ``"%llu"`` added. 5607db96d56Sopenharmony_ci 5617db96d56Sopenharmony_ci .. versionchanged:: 3.3 5627db96d56Sopenharmony_ci Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. 5637db96d56Sopenharmony_ci 5647db96d56Sopenharmony_ci .. versionchanged:: 3.4 5657db96d56Sopenharmony_ci Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``, 5667db96d56Sopenharmony_ci ``"%V"``, ``"%S"``, ``"%R"`` added. 5677db96d56Sopenharmony_ci 5687db96d56Sopenharmony_ci 5697db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 5707db96d56Sopenharmony_ci 5717db96d56Sopenharmony_ci Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two 5727db96d56Sopenharmony_ci arguments. 5737db96d56Sopenharmony_ci 5747db96d56Sopenharmony_ci 5757db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 5767db96d56Sopenharmony_ci 5777db96d56Sopenharmony_ci Copy an instance of a Unicode subtype to a new true Unicode object if 5787db96d56Sopenharmony_ci necessary. If *obj* is already a true Unicode object (not a subtype), 5797db96d56Sopenharmony_ci return the reference with incremented refcount. 5807db96d56Sopenharmony_ci 5817db96d56Sopenharmony_ci Objects other than Unicode or its subtypes will cause a :exc:`TypeError`. 5827db96d56Sopenharmony_ci 5837db96d56Sopenharmony_ci 5847db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ 5857db96d56Sopenharmony_ci const char *encoding, const char *errors) 5867db96d56Sopenharmony_ci 5877db96d56Sopenharmony_ci Decode an encoded object *obj* to a Unicode object. 5887db96d56Sopenharmony_ci 5897db96d56Sopenharmony_ci :class:`bytes`, :class:`bytearray` and other 5907db96d56Sopenharmony_ci :term:`bytes-like objects <bytes-like object>` 5917db96d56Sopenharmony_ci are decoded according to the given *encoding* and using the error handling 5927db96d56Sopenharmony_ci defined by *errors*. Both can be ``NULL`` to have the interface use the default 5937db96d56Sopenharmony_ci values (see :ref:`builtincodecs` for details). 5947db96d56Sopenharmony_ci 5957db96d56Sopenharmony_ci All other objects, including Unicode objects, cause a :exc:`TypeError` to be 5967db96d56Sopenharmony_ci set. 5977db96d56Sopenharmony_ci 5987db96d56Sopenharmony_ci The API returns ``NULL`` if there was an error. The caller is responsible for 5997db96d56Sopenharmony_ci decref'ing the returned objects. 6007db96d56Sopenharmony_ci 6017db96d56Sopenharmony_ci 6027db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) 6037db96d56Sopenharmony_ci 6047db96d56Sopenharmony_ci Return the length of the Unicode object, in code points. 6057db96d56Sopenharmony_ci 6067db96d56Sopenharmony_ci .. versionadded:: 3.3 6077db96d56Sopenharmony_ci 6087db96d56Sopenharmony_ci 6097db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \ 6107db96d56Sopenharmony_ci Py_ssize_t to_start, \ 6117db96d56Sopenharmony_ci PyObject *from, \ 6127db96d56Sopenharmony_ci Py_ssize_t from_start, \ 6137db96d56Sopenharmony_ci Py_ssize_t how_many) 6147db96d56Sopenharmony_ci 6157db96d56Sopenharmony_ci Copy characters from one Unicode object into another. This function performs 6167db96d56Sopenharmony_ci character conversion when necessary and falls back to :c:func:`memcpy` if 6177db96d56Sopenharmony_ci possible. Returns ``-1`` and sets an exception on error, otherwise returns 6187db96d56Sopenharmony_ci the number of copied characters. 6197db96d56Sopenharmony_ci 6207db96d56Sopenharmony_ci .. versionadded:: 3.3 6217db96d56Sopenharmony_ci 6227db96d56Sopenharmony_ci 6237db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ 6247db96d56Sopenharmony_ci Py_ssize_t length, Py_UCS4 fill_char) 6257db96d56Sopenharmony_ci 6267db96d56Sopenharmony_ci Fill a string with a character: write *fill_char* into 6277db96d56Sopenharmony_ci ``unicode[start:start+length]``. 6287db96d56Sopenharmony_ci 6297db96d56Sopenharmony_ci Fail if *fill_char* is bigger than the string maximum character, or if the 6307db96d56Sopenharmony_ci string has more than 1 reference. 6317db96d56Sopenharmony_ci 6327db96d56Sopenharmony_ci Return the number of written character, or return ``-1`` and raise an 6337db96d56Sopenharmony_ci exception on error. 6347db96d56Sopenharmony_ci 6357db96d56Sopenharmony_ci .. versionadded:: 3.3 6367db96d56Sopenharmony_ci 6377db96d56Sopenharmony_ci 6387db96d56Sopenharmony_ci.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ 6397db96d56Sopenharmony_ci Py_UCS4 character) 6407db96d56Sopenharmony_ci 6417db96d56Sopenharmony_ci Write a character to a string. The string must have been created through 6427db96d56Sopenharmony_ci :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, 6437db96d56Sopenharmony_ci the string must not be shared, or have been hashed yet. 6447db96d56Sopenharmony_ci 6457db96d56Sopenharmony_ci This function checks that *unicode* is a Unicode object, that the index is 6467db96d56Sopenharmony_ci not out of bounds, and that the object can be modified safely (i.e. that it 6477db96d56Sopenharmony_ci its reference count is one). 6487db96d56Sopenharmony_ci 6497db96d56Sopenharmony_ci .. versionadded:: 3.3 6507db96d56Sopenharmony_ci 6517db96d56Sopenharmony_ci 6527db96d56Sopenharmony_ci.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) 6537db96d56Sopenharmony_ci 6547db96d56Sopenharmony_ci Read a character from a string. This function checks that *unicode* is a 6557db96d56Sopenharmony_ci Unicode object and the index is not out of bounds, in contrast to 6567db96d56Sopenharmony_ci :c:func:`PyUnicode_READ_CHAR`, which performs no error checking. 6577db96d56Sopenharmony_ci 6587db96d56Sopenharmony_ci .. versionadded:: 3.3 6597db96d56Sopenharmony_ci 6607db96d56Sopenharmony_ci 6617db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ 6627db96d56Sopenharmony_ci Py_ssize_t end) 6637db96d56Sopenharmony_ci 6647db96d56Sopenharmony_ci Return a substring of *str*, from character index *start* (included) to 6657db96d56Sopenharmony_ci character index *end* (excluded). Negative indices are not supported. 6667db96d56Sopenharmony_ci 6677db96d56Sopenharmony_ci .. versionadded:: 3.3 6687db96d56Sopenharmony_ci 6697db96d56Sopenharmony_ci 6707db96d56Sopenharmony_ci.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ 6717db96d56Sopenharmony_ci Py_ssize_t buflen, int copy_null) 6727db96d56Sopenharmony_ci 6737db96d56Sopenharmony_ci Copy the string *u* into a UCS4 buffer, including a null character, if 6747db96d56Sopenharmony_ci *copy_null* is set. Returns ``NULL`` and sets an exception on error (in 6757db96d56Sopenharmony_ci particular, a :exc:`SystemError` if *buflen* is smaller than the length of 6767db96d56Sopenharmony_ci *u*). *buffer* is returned on success. 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci .. versionadded:: 3.3 6797db96d56Sopenharmony_ci 6807db96d56Sopenharmony_ci 6817db96d56Sopenharmony_ci.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) 6827db96d56Sopenharmony_ci 6837db96d56Sopenharmony_ci Copy the string *u* into a new UCS4 buffer that is allocated using 6847db96d56Sopenharmony_ci :c:func:`PyMem_Malloc`. If this fails, ``NULL`` is returned with a 6857db96d56Sopenharmony_ci :exc:`MemoryError` set. The returned buffer always has an extra 6867db96d56Sopenharmony_ci null code point appended. 6877db96d56Sopenharmony_ci 6887db96d56Sopenharmony_ci .. versionadded:: 3.3 6897db96d56Sopenharmony_ci 6907db96d56Sopenharmony_ci 6917db96d56Sopenharmony_ciDeprecated Py_UNICODE APIs 6927db96d56Sopenharmony_ci"""""""""""""""""""""""""" 6937db96d56Sopenharmony_ci 6947db96d56Sopenharmony_ci.. deprecated-removed:: 3.3 3.12 6957db96d56Sopenharmony_ci 6967db96d56Sopenharmony_ciThese API functions are deprecated with the implementation of :pep:`393`. 6977db96d56Sopenharmony_ciExtension modules can continue using them, as they will not be removed in Python 6987db96d56Sopenharmony_ci3.x, but need to be aware that their use can now cause performance and memory hits. 6997db96d56Sopenharmony_ci 7007db96d56Sopenharmony_ci 7017db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 7027db96d56Sopenharmony_ci 7037db96d56Sopenharmony_ci Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 7047db96d56Sopenharmony_ci may be ``NULL`` which causes the contents to be undefined. It is the user's 7057db96d56Sopenharmony_ci responsibility to fill in the needed data. The buffer is copied into the new 7067db96d56Sopenharmony_ci object. 7077db96d56Sopenharmony_ci 7087db96d56Sopenharmony_ci If the buffer is not ``NULL``, the return value might be a shared object. 7097db96d56Sopenharmony_ci Therefore, modification of the resulting Unicode object is only allowed when 7107db96d56Sopenharmony_ci *u* is ``NULL``. 7117db96d56Sopenharmony_ci 7127db96d56Sopenharmony_ci If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the 7137db96d56Sopenharmony_ci string content has been filled before using any of the access macros such as 7147db96d56Sopenharmony_ci :c:func:`PyUnicode_KIND`. 7157db96d56Sopenharmony_ci 7167db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 7177db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 7187db96d56Sopenharmony_ci :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or 7197db96d56Sopenharmony_ci :c:func:`PyUnicode_New`. 7207db96d56Sopenharmony_ci 7217db96d56Sopenharmony_ci 7227db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 7237db96d56Sopenharmony_ci 7247db96d56Sopenharmony_ci Return a read-only pointer to the Unicode object's internal 7257db96d56Sopenharmony_ci :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the 7267db96d56Sopenharmony_ci :c:expr:`Py_UNICODE*` representation of the object if it is not yet 7277db96d56Sopenharmony_ci available. The buffer is always terminated with an extra null code point. 7287db96d56Sopenharmony_ci Note that the resulting :c:type:`Py_UNICODE` string may also contain 7297db96d56Sopenharmony_ci embedded null code points, which would cause the string to be truncated when 7307db96d56Sopenharmony_ci used in most C functions. 7317db96d56Sopenharmony_ci 7327db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 7337db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 7347db96d56Sopenharmony_ci :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 7357db96d56Sopenharmony_ci :c:func:`PyUnicode_ReadChar` or similar new APIs. 7367db96d56Sopenharmony_ci 7377db96d56Sopenharmony_ci 7387db96d56Sopenharmony_ci.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) 7397db96d56Sopenharmony_ci 7407db96d56Sopenharmony_ci Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` 7417db96d56Sopenharmony_ci array length (excluding the extra null terminator) in *size*. 7427db96d56Sopenharmony_ci Note that the resulting :c:expr:`Py_UNICODE*` string 7437db96d56Sopenharmony_ci may contain embedded null code points, which would cause the string to be 7447db96d56Sopenharmony_ci truncated when used in most C functions. 7457db96d56Sopenharmony_ci 7467db96d56Sopenharmony_ci .. versionadded:: 3.3 7477db96d56Sopenharmony_ci 7487db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 7497db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 7507db96d56Sopenharmony_ci :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 7517db96d56Sopenharmony_ci :c:func:`PyUnicode_ReadChar` or similar new APIs. 7527db96d56Sopenharmony_ci 7537db96d56Sopenharmony_ci 7547db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 7557db96d56Sopenharmony_ci 7567db96d56Sopenharmony_ci Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 7577db96d56Sopenharmony_ci code units (this includes surrogate pairs as 2 units). 7587db96d56Sopenharmony_ci 7597db96d56Sopenharmony_ci .. deprecated-removed:: 3.3 3.12 7607db96d56Sopenharmony_ci Part of the old-style Unicode API, please migrate to using 7617db96d56Sopenharmony_ci :c:func:`PyUnicode_GET_LENGTH`. 7627db96d56Sopenharmony_ci 7637db96d56Sopenharmony_ci 7647db96d56Sopenharmony_ciLocale Encoding 7657db96d56Sopenharmony_ci""""""""""""""" 7667db96d56Sopenharmony_ci 7677db96d56Sopenharmony_ciThe current locale encoding can be used to decode text from the operating 7687db96d56Sopenharmony_cisystem. 7697db96d56Sopenharmony_ci 7707db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ 7717db96d56Sopenharmony_ci Py_ssize_t len, \ 7727db96d56Sopenharmony_ci const char *errors) 7737db96d56Sopenharmony_ci 7747db96d56Sopenharmony_ci Decode a string from UTF-8 on Android and VxWorks, or from the current 7757db96d56Sopenharmony_ci locale encoding on other platforms. The supported 7767db96d56Sopenharmony_ci error handlers are ``"strict"`` and ``"surrogateescape"`` 7777db96d56Sopenharmony_ci (:pep:`383`). The decoder uses ``"strict"`` error handler if 7787db96d56Sopenharmony_ci *errors* is ``NULL``. *str* must end with a null character but 7797db96d56Sopenharmony_ci cannot contain embedded null characters. 7807db96d56Sopenharmony_ci 7817db96d56Sopenharmony_ci Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from 7827db96d56Sopenharmony_ci :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 7837db96d56Sopenharmony_ci Python startup). 7847db96d56Sopenharmony_ci 7857db96d56Sopenharmony_ci This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`. 7867db96d56Sopenharmony_ci 7877db96d56Sopenharmony_ci .. seealso:: 7887db96d56Sopenharmony_ci 7897db96d56Sopenharmony_ci The :c:func:`Py_DecodeLocale` function. 7907db96d56Sopenharmony_ci 7917db96d56Sopenharmony_ci .. versionadded:: 3.3 7927db96d56Sopenharmony_ci 7937db96d56Sopenharmony_ci .. versionchanged:: 3.7 7947db96d56Sopenharmony_ci The function now also uses the current locale encoding for the 7957db96d56Sopenharmony_ci ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale` 7967db96d56Sopenharmony_ci was used for the ``surrogateescape``, and the current locale encoding was 7977db96d56Sopenharmony_ci used for ``strict``. 7987db96d56Sopenharmony_ci 7997db96d56Sopenharmony_ci 8007db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) 8017db96d56Sopenharmony_ci 8027db96d56Sopenharmony_ci Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string 8037db96d56Sopenharmony_ci length using :c:func:`strlen`. 8047db96d56Sopenharmony_ci 8057db96d56Sopenharmony_ci .. versionadded:: 3.3 8067db96d56Sopenharmony_ci 8077db96d56Sopenharmony_ci 8087db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) 8097db96d56Sopenharmony_ci 8107db96d56Sopenharmony_ci Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current 8117db96d56Sopenharmony_ci locale encoding on other platforms. The 8127db96d56Sopenharmony_ci supported error handlers are ``"strict"`` and ``"surrogateescape"`` 8137db96d56Sopenharmony_ci (:pep:`383`). The encoder uses ``"strict"`` error handler if 8147db96d56Sopenharmony_ci *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot 8157db96d56Sopenharmony_ci contain embedded null characters. 8167db96d56Sopenharmony_ci 8177db96d56Sopenharmony_ci Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to 8187db96d56Sopenharmony_ci :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 8197db96d56Sopenharmony_ci Python startup). 8207db96d56Sopenharmony_ci 8217db96d56Sopenharmony_ci This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`. 8227db96d56Sopenharmony_ci 8237db96d56Sopenharmony_ci .. seealso:: 8247db96d56Sopenharmony_ci 8257db96d56Sopenharmony_ci The :c:func:`Py_EncodeLocale` function. 8267db96d56Sopenharmony_ci 8277db96d56Sopenharmony_ci .. versionadded:: 3.3 8287db96d56Sopenharmony_ci 8297db96d56Sopenharmony_ci .. versionchanged:: 3.7 8307db96d56Sopenharmony_ci The function now also uses the current locale encoding for the 8317db96d56Sopenharmony_ci ``surrogateescape`` error handler, except on Android. Previously, 8327db96d56Sopenharmony_ci :c:func:`Py_EncodeLocale` 8337db96d56Sopenharmony_ci was used for the ``surrogateescape``, and the current locale encoding was 8347db96d56Sopenharmony_ci used for ``strict``. 8357db96d56Sopenharmony_ci 8367db96d56Sopenharmony_ci 8377db96d56Sopenharmony_ciFile System Encoding 8387db96d56Sopenharmony_ci"""""""""""""""""""" 8397db96d56Sopenharmony_ci 8407db96d56Sopenharmony_ciTo encode and decode file names and other environment strings, 8417db96d56Sopenharmony_ci:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and 8427db96d56Sopenharmony_ci:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler 8437db96d56Sopenharmony_ci(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during 8447db96d56Sopenharmony_ciargument parsing, the ``"O&"`` converter should be used, passing 8457db96d56Sopenharmony_ci:c:func:`PyUnicode_FSConverter` as the conversion function: 8467db96d56Sopenharmony_ci 8477db96d56Sopenharmony_ci.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result) 8487db96d56Sopenharmony_ci 8497db96d56Sopenharmony_ci ParseTuple converter: encode :class:`str` objects -- obtained directly or 8507db96d56Sopenharmony_ci through the :class:`os.PathLike` interface -- to :class:`bytes` using 8517db96d56Sopenharmony_ci :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is. 8527db96d56Sopenharmony_ci *result* must be a :c:expr:`PyBytesObject*` which must be released when it is 8537db96d56Sopenharmony_ci no longer used. 8547db96d56Sopenharmony_ci 8557db96d56Sopenharmony_ci .. versionadded:: 3.1 8567db96d56Sopenharmony_ci 8577db96d56Sopenharmony_ci .. versionchanged:: 3.6 8587db96d56Sopenharmony_ci Accepts a :term:`path-like object`. 8597db96d56Sopenharmony_ci 8607db96d56Sopenharmony_ciTo decode file names to :class:`str` during argument parsing, the ``"O&"`` 8617db96d56Sopenharmony_ciconverter should be used, passing :c:func:`PyUnicode_FSDecoder` as the 8627db96d56Sopenharmony_ciconversion function: 8637db96d56Sopenharmony_ci 8647db96d56Sopenharmony_ci.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result) 8657db96d56Sopenharmony_ci 8667db96d56Sopenharmony_ci ParseTuple converter: decode :class:`bytes` objects -- obtained either 8677db96d56Sopenharmony_ci directly or indirectly through the :class:`os.PathLike` interface -- to 8687db96d56Sopenharmony_ci :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` 8697db96d56Sopenharmony_ci objects are output as-is. *result* must be a :c:expr:`PyUnicodeObject*` which 8707db96d56Sopenharmony_ci must be released when it is no longer used. 8717db96d56Sopenharmony_ci 8727db96d56Sopenharmony_ci .. versionadded:: 3.2 8737db96d56Sopenharmony_ci 8747db96d56Sopenharmony_ci .. versionchanged:: 3.6 8757db96d56Sopenharmony_ci Accepts a :term:`path-like object`. 8767db96d56Sopenharmony_ci 8777db96d56Sopenharmony_ci 8787db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) 8797db96d56Sopenharmony_ci 8807db96d56Sopenharmony_ci Decode a string from the :term:`filesystem encoding and error handler`. 8817db96d56Sopenharmony_ci 8827db96d56Sopenharmony_ci If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 8837db96d56Sopenharmony_ci locale encoding. 8847db96d56Sopenharmony_ci 8857db96d56Sopenharmony_ci :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 8867db96d56Sopenharmony_ci locale encoding and cannot be modified later. If you need to decode a string 8877db96d56Sopenharmony_ci from the current locale encoding, use 8887db96d56Sopenharmony_ci :c:func:`PyUnicode_DecodeLocaleAndSize`. 8897db96d56Sopenharmony_ci 8907db96d56Sopenharmony_ci .. seealso:: 8917db96d56Sopenharmony_ci 8927db96d56Sopenharmony_ci The :c:func:`Py_DecodeLocale` function. 8937db96d56Sopenharmony_ci 8947db96d56Sopenharmony_ci .. versionchanged:: 3.6 8957db96d56Sopenharmony_ci Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 8967db96d56Sopenharmony_ci 8977db96d56Sopenharmony_ci 8987db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) 8997db96d56Sopenharmony_ci 9007db96d56Sopenharmony_ci Decode a null-terminated string from the :term:`filesystem encoding and 9017db96d56Sopenharmony_ci error handler`. 9027db96d56Sopenharmony_ci 9037db96d56Sopenharmony_ci If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 9047db96d56Sopenharmony_ci locale encoding. 9057db96d56Sopenharmony_ci 9067db96d56Sopenharmony_ci Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. 9077db96d56Sopenharmony_ci 9087db96d56Sopenharmony_ci .. versionchanged:: 3.6 9097db96d56Sopenharmony_ci Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 9107db96d56Sopenharmony_ci 9117db96d56Sopenharmony_ci 9127db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) 9137db96d56Sopenharmony_ci 9147db96d56Sopenharmony_ci Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the 9157db96d56Sopenharmony_ci :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return 9167db96d56Sopenharmony_ci :class:`bytes`. Note that the resulting :class:`bytes` object may contain 9177db96d56Sopenharmony_ci null bytes. 9187db96d56Sopenharmony_ci 9197db96d56Sopenharmony_ci If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 9207db96d56Sopenharmony_ci locale encoding. 9217db96d56Sopenharmony_ci 9227db96d56Sopenharmony_ci :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 9237db96d56Sopenharmony_ci locale encoding and cannot be modified later. If you need to encode a string 9247db96d56Sopenharmony_ci to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`. 9257db96d56Sopenharmony_ci 9267db96d56Sopenharmony_ci .. seealso:: 9277db96d56Sopenharmony_ci 9287db96d56Sopenharmony_ci The :c:func:`Py_EncodeLocale` function. 9297db96d56Sopenharmony_ci 9307db96d56Sopenharmony_ci .. versionadded:: 3.2 9317db96d56Sopenharmony_ci 9327db96d56Sopenharmony_ci .. versionchanged:: 3.6 9337db96d56Sopenharmony_ci Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 9347db96d56Sopenharmony_ci 9357db96d56Sopenharmony_ciwchar_t Support 9367db96d56Sopenharmony_ci""""""""""""""" 9377db96d56Sopenharmony_ci 9387db96d56Sopenharmony_ci:c:expr:`wchar_t` support for platforms which support it: 9397db96d56Sopenharmony_ci 9407db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 9417db96d56Sopenharmony_ci 9427db96d56Sopenharmony_ci Create a Unicode object from the :c:expr:`wchar_t` buffer *w* of the given *size*. 9437db96d56Sopenharmony_ci Passing ``-1`` as the *size* indicates that the function must itself compute the length, 9447db96d56Sopenharmony_ci using wcslen. 9457db96d56Sopenharmony_ci Return ``NULL`` on failure. 9467db96d56Sopenharmony_ci 9477db96d56Sopenharmony_ci 9487db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size) 9497db96d56Sopenharmony_ci 9507db96d56Sopenharmony_ci Copy the Unicode object contents into the :c:expr:`wchar_t` buffer *w*. At most 9517db96d56Sopenharmony_ci *size* :c:expr:`wchar_t` characters are copied (excluding a possibly trailing 9527db96d56Sopenharmony_ci null termination character). Return the number of :c:expr:`wchar_t` characters 9537db96d56Sopenharmony_ci copied or ``-1`` in case of an error. Note that the resulting :c:expr:`wchar_t*` 9547db96d56Sopenharmony_ci string may or may not be null-terminated. It is the responsibility of the caller 9557db96d56Sopenharmony_ci to make sure that the :c:expr:`wchar_t*` string is null-terminated in case this is 9567db96d56Sopenharmony_ci required by the application. Also, note that the :c:expr:`wchar_t*` string 9577db96d56Sopenharmony_ci might contain null characters, which would cause the string to be truncated 9587db96d56Sopenharmony_ci when used with most C functions. 9597db96d56Sopenharmony_ci 9607db96d56Sopenharmony_ci 9617db96d56Sopenharmony_ci.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) 9627db96d56Sopenharmony_ci 9637db96d56Sopenharmony_ci Convert the Unicode object to a wide character string. The output string 9647db96d56Sopenharmony_ci always ends with a null character. If *size* is not ``NULL``, write the number 9657db96d56Sopenharmony_ci of wide characters (excluding the trailing null termination character) into 9667db96d56Sopenharmony_ci *\*size*. Note that the resulting :c:expr:`wchar_t` string might contain 9677db96d56Sopenharmony_ci null characters, which would cause the string to be truncated when used with 9687db96d56Sopenharmony_ci most C functions. If *size* is ``NULL`` and the :c:expr:`wchar_t*` string 9697db96d56Sopenharmony_ci contains null characters a :exc:`ValueError` is raised. 9707db96d56Sopenharmony_ci 9717db96d56Sopenharmony_ci Returns a buffer allocated by :c:func:`PyMem_Alloc` (use 9727db96d56Sopenharmony_ci :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL`` 9737db96d56Sopenharmony_ci and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation 9747db96d56Sopenharmony_ci is failed. 9757db96d56Sopenharmony_ci 9767db96d56Sopenharmony_ci .. versionadded:: 3.2 9777db96d56Sopenharmony_ci 9787db96d56Sopenharmony_ci .. versionchanged:: 3.7 9797db96d56Sopenharmony_ci Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:expr:`wchar_t*` 9807db96d56Sopenharmony_ci string contains null characters. 9817db96d56Sopenharmony_ci 9827db96d56Sopenharmony_ci 9837db96d56Sopenharmony_ci.. _builtincodecs: 9847db96d56Sopenharmony_ci 9857db96d56Sopenharmony_ciBuilt-in Codecs 9867db96d56Sopenharmony_ci^^^^^^^^^^^^^^^ 9877db96d56Sopenharmony_ci 9887db96d56Sopenharmony_ciPython provides a set of built-in codecs which are written in C for speed. All of 9897db96d56Sopenharmony_cithese codecs are directly usable via the following functions. 9907db96d56Sopenharmony_ci 9917db96d56Sopenharmony_ciMany of the following APIs take two arguments encoding and errors, and they 9927db96d56Sopenharmony_cihave the same semantics as the ones of the built-in :func:`str` string object 9937db96d56Sopenharmony_ciconstructor. 9947db96d56Sopenharmony_ci 9957db96d56Sopenharmony_ciSetting encoding to ``NULL`` causes the default encoding to be used 9967db96d56Sopenharmony_ciwhich is UTF-8. The file system calls should use 9977db96d56Sopenharmony_ci:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the 9987db96d56Sopenharmony_civariable :c:data:`Py_FileSystemDefaultEncoding` internally. This 9997db96d56Sopenharmony_civariable should be treated as read-only: on some systems, it will be a 10007db96d56Sopenharmony_cipointer to a static string, on others, it will change at run-time 10017db96d56Sopenharmony_ci(such as when the application invokes setlocale). 10027db96d56Sopenharmony_ci 10037db96d56Sopenharmony_ciError handling is set by errors which may also be set to ``NULL`` meaning to use 10047db96d56Sopenharmony_cithe default handling defined for the codec. Default error handling for all 10057db96d56Sopenharmony_cibuilt-in codecs is "strict" (:exc:`ValueError` is raised). 10067db96d56Sopenharmony_ci 10077db96d56Sopenharmony_ciThe codecs all use a similar interface. Only deviations from the following 10087db96d56Sopenharmony_cigeneric ones are documented for simplicity. 10097db96d56Sopenharmony_ci 10107db96d56Sopenharmony_ci 10117db96d56Sopenharmony_ciGeneric Codecs 10127db96d56Sopenharmony_ci"""""""""""""" 10137db96d56Sopenharmony_ci 10147db96d56Sopenharmony_ciThese are the generic codec APIs: 10157db96d56Sopenharmony_ci 10167db96d56Sopenharmony_ci 10177db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ 10187db96d56Sopenharmony_ci const char *encoding, const char *errors) 10197db96d56Sopenharmony_ci 10207db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the encoded string *s*. 10217db96d56Sopenharmony_ci *encoding* and *errors* have the same meaning as the parameters of the same name 10227db96d56Sopenharmony_ci in the :func:`str` built-in function. The codec to be used is looked up 10237db96d56Sopenharmony_ci using the Python codec registry. Return ``NULL`` if an exception was raised by 10247db96d56Sopenharmony_ci the codec. 10257db96d56Sopenharmony_ci 10267db96d56Sopenharmony_ci 10277db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ 10287db96d56Sopenharmony_ci const char *encoding, const char *errors) 10297db96d56Sopenharmony_ci 10307db96d56Sopenharmony_ci Encode a Unicode object and return the result as Python bytes object. 10317db96d56Sopenharmony_ci *encoding* and *errors* have the same meaning as the parameters of the same 10327db96d56Sopenharmony_ci name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up 10337db96d56Sopenharmony_ci using the Python codec registry. Return ``NULL`` if an exception was raised by 10347db96d56Sopenharmony_ci the codec. 10357db96d56Sopenharmony_ci 10367db96d56Sopenharmony_ci 10377db96d56Sopenharmony_ciUTF-8 Codecs 10387db96d56Sopenharmony_ci"""""""""""" 10397db96d56Sopenharmony_ci 10407db96d56Sopenharmony_ciThese are the UTF-8 codec APIs: 10417db96d56Sopenharmony_ci 10427db96d56Sopenharmony_ci 10437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 10447db96d56Sopenharmony_ci 10457db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 10467db96d56Sopenharmony_ci *s*. Return ``NULL`` if an exception was raised by the codec. 10477db96d56Sopenharmony_ci 10487db96d56Sopenharmony_ci 10497db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ 10507db96d56Sopenharmony_ci const char *errors, Py_ssize_t *consumed) 10517db96d56Sopenharmony_ci 10527db96d56Sopenharmony_ci If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If 10537db96d56Sopenharmony_ci *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be 10547db96d56Sopenharmony_ci treated as an error. Those bytes will not be decoded and the number of bytes 10557db96d56Sopenharmony_ci that have been decoded will be stored in *consumed*. 10567db96d56Sopenharmony_ci 10577db96d56Sopenharmony_ci 10587db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 10597db96d56Sopenharmony_ci 10607db96d56Sopenharmony_ci Encode a Unicode object using UTF-8 and return the result as Python bytes 10617db96d56Sopenharmony_ci object. Error handling is "strict". Return ``NULL`` if an exception was 10627db96d56Sopenharmony_ci raised by the codec. 10637db96d56Sopenharmony_ci 10647db96d56Sopenharmony_ci 10657db96d56Sopenharmony_ci.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) 10667db96d56Sopenharmony_ci 10677db96d56Sopenharmony_ci Return a pointer to the UTF-8 encoding of the Unicode object, and 10687db96d56Sopenharmony_ci store the size of the encoded representation (in bytes) in *size*. The 10697db96d56Sopenharmony_ci *size* argument can be ``NULL``; in this case no size will be stored. The 10707db96d56Sopenharmony_ci returned buffer always has an extra null byte appended (not included in 10717db96d56Sopenharmony_ci *size*), regardless of whether there are any other null code points. 10727db96d56Sopenharmony_ci 10737db96d56Sopenharmony_ci In the case of an error, ``NULL`` is returned with an exception set and no 10747db96d56Sopenharmony_ci *size* is stored. 10757db96d56Sopenharmony_ci 10767db96d56Sopenharmony_ci This caches the UTF-8 representation of the string in the Unicode object, and 10777db96d56Sopenharmony_ci subsequent calls will return a pointer to the same buffer. The caller is not 10787db96d56Sopenharmony_ci responsible for deallocating the buffer. The buffer is deallocated and 10797db96d56Sopenharmony_ci pointers to it become invalid when the Unicode object is garbage collected. 10807db96d56Sopenharmony_ci 10817db96d56Sopenharmony_ci .. versionadded:: 3.3 10827db96d56Sopenharmony_ci 10837db96d56Sopenharmony_ci .. versionchanged:: 3.7 10847db96d56Sopenharmony_ci The return type is now ``const char *`` rather of ``char *``. 10857db96d56Sopenharmony_ci 10867db96d56Sopenharmony_ci .. versionchanged:: 3.10 10877db96d56Sopenharmony_ci This function is a part of the :ref:`limited API <stable>`. 10887db96d56Sopenharmony_ci 10897db96d56Sopenharmony_ci 10907db96d56Sopenharmony_ci.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode) 10917db96d56Sopenharmony_ci 10927db96d56Sopenharmony_ci As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. 10937db96d56Sopenharmony_ci 10947db96d56Sopenharmony_ci .. versionadded:: 3.3 10957db96d56Sopenharmony_ci 10967db96d56Sopenharmony_ci .. versionchanged:: 3.7 10977db96d56Sopenharmony_ci The return type is now ``const char *`` rather of ``char *``. 10987db96d56Sopenharmony_ci 10997db96d56Sopenharmony_ci 11007db96d56Sopenharmony_ciUTF-32 Codecs 11017db96d56Sopenharmony_ci""""""""""""" 11027db96d56Sopenharmony_ci 11037db96d56Sopenharmony_ciThese are the UTF-32 codec APIs: 11047db96d56Sopenharmony_ci 11057db96d56Sopenharmony_ci 11067db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ 11077db96d56Sopenharmony_ci const char *errors, int *byteorder) 11087db96d56Sopenharmony_ci 11097db96d56Sopenharmony_ci Decode *size* bytes from a UTF-32 encoded buffer string and return the 11107db96d56Sopenharmony_ci corresponding Unicode object. *errors* (if non-``NULL``) defines the error 11117db96d56Sopenharmony_ci handling. It defaults to "strict". 11127db96d56Sopenharmony_ci 11137db96d56Sopenharmony_ci If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 11147db96d56Sopenharmony_ci order:: 11157db96d56Sopenharmony_ci 11167db96d56Sopenharmony_ci *byteorder == -1: little endian 11177db96d56Sopenharmony_ci *byteorder == 0: native order 11187db96d56Sopenharmony_ci *byteorder == 1: big endian 11197db96d56Sopenharmony_ci 11207db96d56Sopenharmony_ci If ``*byteorder`` is zero, and the first four bytes of the input data are a 11217db96d56Sopenharmony_ci byte order mark (BOM), the decoder switches to this byte order and the BOM is 11227db96d56Sopenharmony_ci not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 11237db96d56Sopenharmony_ci ``1``, any byte order mark is copied to the output. 11247db96d56Sopenharmony_ci 11257db96d56Sopenharmony_ci After completion, *\*byteorder* is set to the current byte order at the end 11267db96d56Sopenharmony_ci of input data. 11277db96d56Sopenharmony_ci 11287db96d56Sopenharmony_ci If *byteorder* is ``NULL``, the codec starts in native order mode. 11297db96d56Sopenharmony_ci 11307db96d56Sopenharmony_ci Return ``NULL`` if an exception was raised by the codec. 11317db96d56Sopenharmony_ci 11327db96d56Sopenharmony_ci 11337db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ 11347db96d56Sopenharmony_ci const char *errors, int *byteorder, Py_ssize_t *consumed) 11357db96d56Sopenharmony_ci 11367db96d56Sopenharmony_ci If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If 11377db96d56Sopenharmony_ci *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 11387db96d56Sopenharmony_ci trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 11397db96d56Sopenharmony_ci by four) as an error. Those bytes will not be decoded and the number of bytes 11407db96d56Sopenharmony_ci that have been decoded will be stored in *consumed*. 11417db96d56Sopenharmony_ci 11427db96d56Sopenharmony_ci 11437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 11447db96d56Sopenharmony_ci 11457db96d56Sopenharmony_ci Return a Python byte string using the UTF-32 encoding in native byte 11467db96d56Sopenharmony_ci order. The string always starts with a BOM mark. Error handling is "strict". 11477db96d56Sopenharmony_ci Return ``NULL`` if an exception was raised by the codec. 11487db96d56Sopenharmony_ci 11497db96d56Sopenharmony_ci 11507db96d56Sopenharmony_ciUTF-16 Codecs 11517db96d56Sopenharmony_ci""""""""""""" 11527db96d56Sopenharmony_ci 11537db96d56Sopenharmony_ciThese are the UTF-16 codec APIs: 11547db96d56Sopenharmony_ci 11557db96d56Sopenharmony_ci 11567db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ 11577db96d56Sopenharmony_ci const char *errors, int *byteorder) 11587db96d56Sopenharmony_ci 11597db96d56Sopenharmony_ci Decode *size* bytes from a UTF-16 encoded buffer string and return the 11607db96d56Sopenharmony_ci corresponding Unicode object. *errors* (if non-``NULL``) defines the error 11617db96d56Sopenharmony_ci handling. It defaults to "strict". 11627db96d56Sopenharmony_ci 11637db96d56Sopenharmony_ci If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 11647db96d56Sopenharmony_ci order:: 11657db96d56Sopenharmony_ci 11667db96d56Sopenharmony_ci *byteorder == -1: little endian 11677db96d56Sopenharmony_ci *byteorder == 0: native order 11687db96d56Sopenharmony_ci *byteorder == 1: big endian 11697db96d56Sopenharmony_ci 11707db96d56Sopenharmony_ci If ``*byteorder`` is zero, and the first two bytes of the input data are a 11717db96d56Sopenharmony_ci byte order mark (BOM), the decoder switches to this byte order and the BOM is 11727db96d56Sopenharmony_ci not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 11737db96d56Sopenharmony_ci ``1``, any byte order mark is copied to the output (where it will result in 11747db96d56Sopenharmony_ci either a ``\ufeff`` or a ``\ufffe`` character). 11757db96d56Sopenharmony_ci 11767db96d56Sopenharmony_ci After completion, ``*byteorder`` is set to the current byte order at the end 11777db96d56Sopenharmony_ci of input data. 11787db96d56Sopenharmony_ci 11797db96d56Sopenharmony_ci If *byteorder* is ``NULL``, the codec starts in native order mode. 11807db96d56Sopenharmony_ci 11817db96d56Sopenharmony_ci Return ``NULL`` if an exception was raised by the codec. 11827db96d56Sopenharmony_ci 11837db96d56Sopenharmony_ci 11847db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ 11857db96d56Sopenharmony_ci const char *errors, int *byteorder, Py_ssize_t *consumed) 11867db96d56Sopenharmony_ci 11877db96d56Sopenharmony_ci If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If 11887db96d56Sopenharmony_ci *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 11897db96d56Sopenharmony_ci trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 11907db96d56Sopenharmony_ci split surrogate pair) as an error. Those bytes will not be decoded and the 11917db96d56Sopenharmony_ci number of bytes that have been decoded will be stored in *consumed*. 11927db96d56Sopenharmony_ci 11937db96d56Sopenharmony_ci 11947db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 11957db96d56Sopenharmony_ci 11967db96d56Sopenharmony_ci Return a Python byte string using the UTF-16 encoding in native byte 11977db96d56Sopenharmony_ci order. The string always starts with a BOM mark. Error handling is "strict". 11987db96d56Sopenharmony_ci Return ``NULL`` if an exception was raised by the codec. 11997db96d56Sopenharmony_ci 12007db96d56Sopenharmony_ci 12017db96d56Sopenharmony_ciUTF-7 Codecs 12027db96d56Sopenharmony_ci"""""""""""" 12037db96d56Sopenharmony_ci 12047db96d56Sopenharmony_ciThese are the UTF-7 codec APIs: 12057db96d56Sopenharmony_ci 12067db96d56Sopenharmony_ci 12077db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 12087db96d56Sopenharmony_ci 12097db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 12107db96d56Sopenharmony_ci *s*. Return ``NULL`` if an exception was raised by the codec. 12117db96d56Sopenharmony_ci 12127db96d56Sopenharmony_ci 12137db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ 12147db96d56Sopenharmony_ci const char *errors, Py_ssize_t *consumed) 12157db96d56Sopenharmony_ci 12167db96d56Sopenharmony_ci If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`. If 12177db96d56Sopenharmony_ci *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not 12187db96d56Sopenharmony_ci be treated as an error. Those bytes will not be decoded and the number of 12197db96d56Sopenharmony_ci bytes that have been decoded will be stored in *consumed*. 12207db96d56Sopenharmony_ci 12217db96d56Sopenharmony_ci 12227db96d56Sopenharmony_ciUnicode-Escape Codecs 12237db96d56Sopenharmony_ci""""""""""""""""""""" 12247db96d56Sopenharmony_ci 12257db96d56Sopenharmony_ciThese are the "Unicode Escape" codec APIs: 12267db96d56Sopenharmony_ci 12277db96d56Sopenharmony_ci 12287db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ 12297db96d56Sopenharmony_ci Py_ssize_t size, const char *errors) 12307db96d56Sopenharmony_ci 12317db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 12327db96d56Sopenharmony_ci string *s*. Return ``NULL`` if an exception was raised by the codec. 12337db96d56Sopenharmony_ci 12347db96d56Sopenharmony_ci 12357db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 12367db96d56Sopenharmony_ci 12377db96d56Sopenharmony_ci Encode a Unicode object using Unicode-Escape and return the result as a 12387db96d56Sopenharmony_ci bytes object. Error handling is "strict". Return ``NULL`` if an exception was 12397db96d56Sopenharmony_ci raised by the codec. 12407db96d56Sopenharmony_ci 12417db96d56Sopenharmony_ci 12427db96d56Sopenharmony_ciRaw-Unicode-Escape Codecs 12437db96d56Sopenharmony_ci""""""""""""""""""""""""" 12447db96d56Sopenharmony_ci 12457db96d56Sopenharmony_ciThese are the "Raw Unicode Escape" codec APIs: 12467db96d56Sopenharmony_ci 12477db96d56Sopenharmony_ci 12487db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ 12497db96d56Sopenharmony_ci Py_ssize_t size, const char *errors) 12507db96d56Sopenharmony_ci 12517db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 12527db96d56Sopenharmony_ci encoded string *s*. Return ``NULL`` if an exception was raised by the codec. 12537db96d56Sopenharmony_ci 12547db96d56Sopenharmony_ci 12557db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 12567db96d56Sopenharmony_ci 12577db96d56Sopenharmony_ci Encode a Unicode object using Raw-Unicode-Escape and return the result as 12587db96d56Sopenharmony_ci a bytes object. Error handling is "strict". Return ``NULL`` if an exception 12597db96d56Sopenharmony_ci was raised by the codec. 12607db96d56Sopenharmony_ci 12617db96d56Sopenharmony_ci 12627db96d56Sopenharmony_ciLatin-1 Codecs 12637db96d56Sopenharmony_ci"""""""""""""" 12647db96d56Sopenharmony_ci 12657db96d56Sopenharmony_ciThese are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 12667db96d56Sopenharmony_ciordinals and only these are accepted by the codecs during encoding. 12677db96d56Sopenharmony_ci 12687db96d56Sopenharmony_ci 12697db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 12707db96d56Sopenharmony_ci 12717db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 12727db96d56Sopenharmony_ci *s*. Return ``NULL`` if an exception was raised by the codec. 12737db96d56Sopenharmony_ci 12747db96d56Sopenharmony_ci 12757db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 12767db96d56Sopenharmony_ci 12777db96d56Sopenharmony_ci Encode a Unicode object using Latin-1 and return the result as Python bytes 12787db96d56Sopenharmony_ci object. Error handling is "strict". Return ``NULL`` if an exception was 12797db96d56Sopenharmony_ci raised by the codec. 12807db96d56Sopenharmony_ci 12817db96d56Sopenharmony_ci 12827db96d56Sopenharmony_ciASCII Codecs 12837db96d56Sopenharmony_ci"""""""""""" 12847db96d56Sopenharmony_ci 12857db96d56Sopenharmony_ciThese are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 12867db96d56Sopenharmony_cicodes generate errors. 12877db96d56Sopenharmony_ci 12887db96d56Sopenharmony_ci 12897db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 12907db96d56Sopenharmony_ci 12917db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the ASCII encoded string 12927db96d56Sopenharmony_ci *s*. Return ``NULL`` if an exception was raised by the codec. 12937db96d56Sopenharmony_ci 12947db96d56Sopenharmony_ci 12957db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 12967db96d56Sopenharmony_ci 12977db96d56Sopenharmony_ci Encode a Unicode object using ASCII and return the result as Python bytes 12987db96d56Sopenharmony_ci object. Error handling is "strict". Return ``NULL`` if an exception was 12997db96d56Sopenharmony_ci raised by the codec. 13007db96d56Sopenharmony_ci 13017db96d56Sopenharmony_ci 13027db96d56Sopenharmony_ciCharacter Map Codecs 13037db96d56Sopenharmony_ci"""""""""""""""""""" 13047db96d56Sopenharmony_ci 13057db96d56Sopenharmony_ciThis codec is special in that it can be used to implement many different codecs 13067db96d56Sopenharmony_ci(and this is in fact what was done to obtain most of the standard codecs 13077db96d56Sopenharmony_ciincluded in the :mod:`encodings` package). The codec uses mappings to encode and 13087db96d56Sopenharmony_cidecode characters. The mapping objects provided must support the 13097db96d56Sopenharmony_ci:meth:`__getitem__` mapping interface; dictionaries and sequences work well. 13107db96d56Sopenharmony_ci 13117db96d56Sopenharmony_ciThese are the mapping codec APIs: 13127db96d56Sopenharmony_ci 13137db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \ 13147db96d56Sopenharmony_ci PyObject *mapping, const char *errors) 13157db96d56Sopenharmony_ci 13167db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the encoded string *s* 13177db96d56Sopenharmony_ci using the given *mapping* object. Return ``NULL`` if an exception was raised 13187db96d56Sopenharmony_ci by the codec. 13197db96d56Sopenharmony_ci 13207db96d56Sopenharmony_ci If *mapping* is ``NULL``, Latin-1 decoding will be applied. Else 13217db96d56Sopenharmony_ci *mapping* must map bytes ordinals (integers in the range from 0 to 255) 13227db96d56Sopenharmony_ci to Unicode strings, integers (which are then interpreted as Unicode 13237db96d56Sopenharmony_ci ordinals) or ``None``. Unmapped data bytes -- ones which cause a 13247db96d56Sopenharmony_ci :exc:`LookupError`, as well as ones which get mapped to ``None``, 13257db96d56Sopenharmony_ci ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause 13267db96d56Sopenharmony_ci an error. 13277db96d56Sopenharmony_ci 13287db96d56Sopenharmony_ci 13297db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 13307db96d56Sopenharmony_ci 13317db96d56Sopenharmony_ci Encode a Unicode object using the given *mapping* object and return the 13327db96d56Sopenharmony_ci result as a bytes object. Error handling is "strict". Return ``NULL`` if an 13337db96d56Sopenharmony_ci exception was raised by the codec. 13347db96d56Sopenharmony_ci 13357db96d56Sopenharmony_ci The *mapping* object must map Unicode ordinal integers to bytes objects, 13367db96d56Sopenharmony_ci integers in the range from 0 to 255 or ``None``. Unmapped character 13377db96d56Sopenharmony_ci ordinals (ones which cause a :exc:`LookupError`) as well as mapped to 13387db96d56Sopenharmony_ci ``None`` are treated as "undefined mapping" and cause an error. 13397db96d56Sopenharmony_ci 13407db96d56Sopenharmony_ci 13417db96d56Sopenharmony_ciThe following codec API is special in that maps Unicode to Unicode. 13427db96d56Sopenharmony_ci 13437db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) 13447db96d56Sopenharmony_ci 13457db96d56Sopenharmony_ci Translate a string by applying a character mapping table to it and return the 13467db96d56Sopenharmony_ci resulting Unicode object. Return ``NULL`` if an exception was raised by the 13477db96d56Sopenharmony_ci codec. 13487db96d56Sopenharmony_ci 13497db96d56Sopenharmony_ci The mapping table must map Unicode ordinal integers to Unicode ordinal integers 13507db96d56Sopenharmony_ci or ``None`` (causing deletion of the character). 13517db96d56Sopenharmony_ci 13527db96d56Sopenharmony_ci Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 13537db96d56Sopenharmony_ci and sequences work well. Unmapped character ordinals (ones which cause a 13547db96d56Sopenharmony_ci :exc:`LookupError`) are left untouched and are copied as-is. 13557db96d56Sopenharmony_ci 13567db96d56Sopenharmony_ci *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to 13577db96d56Sopenharmony_ci use the default error handling. 13587db96d56Sopenharmony_ci 13597db96d56Sopenharmony_ci 13607db96d56Sopenharmony_ciMBCS codecs for Windows 13617db96d56Sopenharmony_ci""""""""""""""""""""""" 13627db96d56Sopenharmony_ci 13637db96d56Sopenharmony_ciThese are the MBCS codec APIs. They are currently only available on Windows and 13647db96d56Sopenharmony_ciuse the Win32 MBCS converters to implement the conversions. Note that MBCS (or 13657db96d56Sopenharmony_ciDBCS) is a class of encodings, not just one. The target encoding is defined by 13667db96d56Sopenharmony_cithe user settings on the machine running the codec. 13677db96d56Sopenharmony_ci 13687db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 13697db96d56Sopenharmony_ci 13707db96d56Sopenharmony_ci Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 13717db96d56Sopenharmony_ci Return ``NULL`` if an exception was raised by the codec. 13727db96d56Sopenharmony_ci 13737db96d56Sopenharmony_ci 13747db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \ 13757db96d56Sopenharmony_ci const char *errors, Py_ssize_t *consumed) 13767db96d56Sopenharmony_ci 13777db96d56Sopenharmony_ci If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If 13787db96d56Sopenharmony_ci *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 13797db96d56Sopenharmony_ci trailing lead byte and the number of bytes that have been decoded will be stored 13807db96d56Sopenharmony_ci in *consumed*. 13817db96d56Sopenharmony_ci 13827db96d56Sopenharmony_ci 13837db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 13847db96d56Sopenharmony_ci 13857db96d56Sopenharmony_ci Encode a Unicode object using MBCS and return the result as Python bytes 13867db96d56Sopenharmony_ci object. Error handling is "strict". Return ``NULL`` if an exception was 13877db96d56Sopenharmony_ci raised by the codec. 13887db96d56Sopenharmony_ci 13897db96d56Sopenharmony_ci 13907db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) 13917db96d56Sopenharmony_ci 13927db96d56Sopenharmony_ci Encode the Unicode object using the specified code page and return a Python 13937db96d56Sopenharmony_ci bytes object. Return ``NULL`` if an exception was raised by the codec. Use 13947db96d56Sopenharmony_ci :c:data:`CP_ACP` code page to get the MBCS encoder. 13957db96d56Sopenharmony_ci 13967db96d56Sopenharmony_ci .. versionadded:: 3.3 13977db96d56Sopenharmony_ci 13987db96d56Sopenharmony_ci 13997db96d56Sopenharmony_ciMethods & Slots 14007db96d56Sopenharmony_ci""""""""""""""" 14017db96d56Sopenharmony_ci 14027db96d56Sopenharmony_ci 14037db96d56Sopenharmony_ci.. _unicodemethodsandslots: 14047db96d56Sopenharmony_ci 14057db96d56Sopenharmony_ciMethods and Slot Functions 14067db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^ 14077db96d56Sopenharmony_ci 14087db96d56Sopenharmony_ciThe following APIs are capable of handling Unicode objects and strings on input 14097db96d56Sopenharmony_ci(we refer to them as strings in the descriptions) and return Unicode objects or 14107db96d56Sopenharmony_ciintegers as appropriate. 14117db96d56Sopenharmony_ci 14127db96d56Sopenharmony_ciThey all return ``NULL`` or ``-1`` if an exception occurs. 14137db96d56Sopenharmony_ci 14147db96d56Sopenharmony_ci 14157db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 14167db96d56Sopenharmony_ci 14177db96d56Sopenharmony_ci Concat two strings giving a new Unicode string. 14187db96d56Sopenharmony_ci 14197db96d56Sopenharmony_ci 14207db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 14217db96d56Sopenharmony_ci 14227db96d56Sopenharmony_ci Split a string giving a list of Unicode strings. If *sep* is ``NULL``, splitting 14237db96d56Sopenharmony_ci will be done at all whitespace substrings. Otherwise, splits occur at the given 14247db96d56Sopenharmony_ci separator. At most *maxsplit* splits will be done. If negative, no limit is 14257db96d56Sopenharmony_ci set. Separators are not included in the resulting list. 14267db96d56Sopenharmony_ci 14277db96d56Sopenharmony_ci 14287db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 14297db96d56Sopenharmony_ci 14307db96d56Sopenharmony_ci Split a Unicode string at line breaks, returning a list of Unicode strings. 14317db96d56Sopenharmony_ci CRLF is considered to be one line break. If *keepend* is ``0``, the line break 14327db96d56Sopenharmony_ci characters are not included in the resulting strings. 14337db96d56Sopenharmony_ci 14347db96d56Sopenharmony_ci 14357db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 14367db96d56Sopenharmony_ci 14377db96d56Sopenharmony_ci Join a sequence of strings using the given *separator* and return the resulting 14387db96d56Sopenharmony_ci Unicode string. 14397db96d56Sopenharmony_ci 14407db96d56Sopenharmony_ci 14417db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ 14427db96d56Sopenharmony_ci Py_ssize_t start, Py_ssize_t end, int direction) 14437db96d56Sopenharmony_ci 14447db96d56Sopenharmony_ci Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 14457db96d56Sopenharmony_ci (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 14467db96d56Sopenharmony_ci ``0`` otherwise. Return ``-1`` if an error occurred. 14477db96d56Sopenharmony_ci 14487db96d56Sopenharmony_ci 14497db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ 14507db96d56Sopenharmony_ci Py_ssize_t start, Py_ssize_t end, int direction) 14517db96d56Sopenharmony_ci 14527db96d56Sopenharmony_ci Return the first position of *substr* in ``str[start:end]`` using the given 14537db96d56Sopenharmony_ci *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 14547db96d56Sopenharmony_ci backward search). The return value is the index of the first match; a value of 14557db96d56Sopenharmony_ci ``-1`` indicates that no match was found, and ``-2`` indicates that an error 14567db96d56Sopenharmony_ci occurred and an exception has been set. 14577db96d56Sopenharmony_ci 14587db96d56Sopenharmony_ci 14597db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ 14607db96d56Sopenharmony_ci Py_ssize_t start, Py_ssize_t end, int direction) 14617db96d56Sopenharmony_ci 14627db96d56Sopenharmony_ci Return the first position of the character *ch* in ``str[start:end]`` using 14637db96d56Sopenharmony_ci the given *direction* (*direction* == ``1`` means to do a forward search, 14647db96d56Sopenharmony_ci *direction* == ``-1`` a backward search). The return value is the index of the 14657db96d56Sopenharmony_ci first match; a value of ``-1`` indicates that no match was found, and ``-2`` 14667db96d56Sopenharmony_ci indicates that an error occurred and an exception has been set. 14677db96d56Sopenharmony_ci 14687db96d56Sopenharmony_ci .. versionadded:: 3.3 14697db96d56Sopenharmony_ci 14707db96d56Sopenharmony_ci .. versionchanged:: 3.7 14717db96d56Sopenharmony_ci *start* and *end* are now adjusted to behave like ``str[start:end]``. 14727db96d56Sopenharmony_ci 14737db96d56Sopenharmony_ci 14747db96d56Sopenharmony_ci.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ 14757db96d56Sopenharmony_ci Py_ssize_t start, Py_ssize_t end) 14767db96d56Sopenharmony_ci 14777db96d56Sopenharmony_ci Return the number of non-overlapping occurrences of *substr* in 14787db96d56Sopenharmony_ci ``str[start:end]``. Return ``-1`` if an error occurred. 14797db96d56Sopenharmony_ci 14807db96d56Sopenharmony_ci 14817db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ 14827db96d56Sopenharmony_ci PyObject *replstr, Py_ssize_t maxcount) 14837db96d56Sopenharmony_ci 14847db96d56Sopenharmony_ci Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 14857db96d56Sopenharmony_ci return the resulting Unicode object. *maxcount* == ``-1`` means replace all 14867db96d56Sopenharmony_ci occurrences. 14877db96d56Sopenharmony_ci 14887db96d56Sopenharmony_ci 14897db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 14907db96d56Sopenharmony_ci 14917db96d56Sopenharmony_ci Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 14927db96d56Sopenharmony_ci respectively. 14937db96d56Sopenharmony_ci 14947db96d56Sopenharmony_ci This function returns ``-1`` upon failure, so one should call 14957db96d56Sopenharmony_ci :c:func:`PyErr_Occurred` to check for errors. 14967db96d56Sopenharmony_ci 14977db96d56Sopenharmony_ci 14987db96d56Sopenharmony_ci.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string) 14997db96d56Sopenharmony_ci 15007db96d56Sopenharmony_ci Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less 15017db96d56Sopenharmony_ci than, equal, and greater than, respectively. It is best to pass only 15027db96d56Sopenharmony_ci ASCII-encoded strings, but the function interprets the input string as 15037db96d56Sopenharmony_ci ISO-8859-1 if it contains non-ASCII characters. 15047db96d56Sopenharmony_ci 15057db96d56Sopenharmony_ci This function does not raise exceptions. 15067db96d56Sopenharmony_ci 15077db96d56Sopenharmony_ci 15087db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 15097db96d56Sopenharmony_ci 15107db96d56Sopenharmony_ci Rich compare two Unicode strings and return one of the following: 15117db96d56Sopenharmony_ci 15127db96d56Sopenharmony_ci * ``NULL`` in case an exception was raised 15137db96d56Sopenharmony_ci * :const:`Py_True` or :const:`Py_False` for successful comparisons 15147db96d56Sopenharmony_ci * :const:`Py_NotImplemented` in case the type combination is unknown 15157db96d56Sopenharmony_ci 15167db96d56Sopenharmony_ci Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 15177db96d56Sopenharmony_ci :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 15187db96d56Sopenharmony_ci 15197db96d56Sopenharmony_ci 15207db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 15217db96d56Sopenharmony_ci 15227db96d56Sopenharmony_ci Return a new string object from *format* and *args*; this is analogous to 15237db96d56Sopenharmony_ci ``format % args``. 15247db96d56Sopenharmony_ci 15257db96d56Sopenharmony_ci 15267db96d56Sopenharmony_ci.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 15277db96d56Sopenharmony_ci 15287db96d56Sopenharmony_ci Check whether *element* is contained in *container* and return true or false 15297db96d56Sopenharmony_ci accordingly. 15307db96d56Sopenharmony_ci 15317db96d56Sopenharmony_ci *element* has to coerce to a one element Unicode string. ``-1`` is returned 15327db96d56Sopenharmony_ci if there was an error. 15337db96d56Sopenharmony_ci 15347db96d56Sopenharmony_ci 15357db96d56Sopenharmony_ci.. c:function:: void PyUnicode_InternInPlace(PyObject **string) 15367db96d56Sopenharmony_ci 15377db96d56Sopenharmony_ci Intern the argument *\*string* in place. The argument must be the address of a 15387db96d56Sopenharmony_ci pointer variable pointing to a Python Unicode string object. If there is an 15397db96d56Sopenharmony_ci existing interned string that is the same as *\*string*, it sets *\*string* to 15407db96d56Sopenharmony_ci it (decrementing the reference count of the old string object and incrementing 15417db96d56Sopenharmony_ci the reference count of the interned string object), otherwise it leaves 15427db96d56Sopenharmony_ci *\*string* alone and interns it (incrementing its reference count). 15437db96d56Sopenharmony_ci (Clarification: even though there is a lot of talk about reference counts, think 15447db96d56Sopenharmony_ci of this function as reference-count-neutral; you own the object after the call 15457db96d56Sopenharmony_ci if and only if you owned it before the call.) 15467db96d56Sopenharmony_ci 15477db96d56Sopenharmony_ci 15487db96d56Sopenharmony_ci.. c:function:: PyObject* PyUnicode_InternFromString(const char *v) 15497db96d56Sopenharmony_ci 15507db96d56Sopenharmony_ci A combination of :c:func:`PyUnicode_FromString` and 15517db96d56Sopenharmony_ci :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string 15527db96d56Sopenharmony_ci object that has been interned, or a new ("owned") reference to an earlier 15537db96d56Sopenharmony_ci interned string object with the same value. 1554