17db96d56Sopenharmony_ci:mod:`codecs` --- Codec registry and base classes 27db96d56Sopenharmony_ci================================================= 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: codecs 57db96d56Sopenharmony_ci :synopsis: Encode and decode data and streams. 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com> 87db96d56Sopenharmony_ci.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 97db96d56Sopenharmony_ci.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> 107db96d56Sopenharmony_ci 117db96d56Sopenharmony_ci**Source code:** :source:`Lib/codecs.py` 127db96d56Sopenharmony_ci 137db96d56Sopenharmony_ci.. index:: 147db96d56Sopenharmony_ci single: Unicode 157db96d56Sopenharmony_ci single: Codecs 167db96d56Sopenharmony_ci pair: Codecs; encode 177db96d56Sopenharmony_ci pair: Codecs; decode 187db96d56Sopenharmony_ci single: streams 197db96d56Sopenharmony_ci pair: stackable; streams 207db96d56Sopenharmony_ci 217db96d56Sopenharmony_ci-------------- 227db96d56Sopenharmony_ci 237db96d56Sopenharmony_ciThis module defines base classes for standard Python codecs (encoders and 247db96d56Sopenharmony_cidecoders) and provides access to the internal Python codec registry, which 257db96d56Sopenharmony_cimanages the codec and error handling lookup process. Most standard codecs 267db96d56Sopenharmony_ciare :term:`text encodings <text encoding>`, which encode text to bytes (and 277db96d56Sopenharmony_cidecode bytes to text), but there are also codecs provided that encode text to 287db96d56Sopenharmony_citext, and bytes to bytes. Custom codecs may encode and decode between arbitrary 297db96d56Sopenharmony_citypes, but some module features are restricted to be used specifically with 307db96d56Sopenharmony_ci:term:`text encodings <text encoding>` or with codecs that encode to 317db96d56Sopenharmony_ci:class:`bytes`. 327db96d56Sopenharmony_ci 337db96d56Sopenharmony_ciThe module defines the following functions for encoding and decoding with 347db96d56Sopenharmony_ciany codec: 357db96d56Sopenharmony_ci 367db96d56Sopenharmony_ci.. function:: encode(obj, encoding='utf-8', errors='strict') 377db96d56Sopenharmony_ci 387db96d56Sopenharmony_ci Encodes *obj* using the codec registered for *encoding*. 397db96d56Sopenharmony_ci 407db96d56Sopenharmony_ci *Errors* may be given to set the desired error handling scheme. The 417db96d56Sopenharmony_ci default error handler is ``'strict'`` meaning that encoding errors raise 427db96d56Sopenharmony_ci :exc:`ValueError` (or a more codec specific subclass, such as 437db96d56Sopenharmony_ci :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more 447db96d56Sopenharmony_ci information on codec error handling. 457db96d56Sopenharmony_ci 467db96d56Sopenharmony_ci.. function:: decode(obj, encoding='utf-8', errors='strict') 477db96d56Sopenharmony_ci 487db96d56Sopenharmony_ci Decodes *obj* using the codec registered for *encoding*. 497db96d56Sopenharmony_ci 507db96d56Sopenharmony_ci *Errors* may be given to set the desired error handling scheme. The 517db96d56Sopenharmony_ci default error handler is ``'strict'`` meaning that decoding errors raise 527db96d56Sopenharmony_ci :exc:`ValueError` (or a more codec specific subclass, such as 537db96d56Sopenharmony_ci :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more 547db96d56Sopenharmony_ci information on codec error handling. 557db96d56Sopenharmony_ci 567db96d56Sopenharmony_ciThe full details for each codec can also be looked up directly: 577db96d56Sopenharmony_ci 587db96d56Sopenharmony_ci.. function:: lookup(encoding) 597db96d56Sopenharmony_ci 607db96d56Sopenharmony_ci Looks up the codec info in the Python codec registry and returns a 617db96d56Sopenharmony_ci :class:`CodecInfo` object as defined below. 627db96d56Sopenharmony_ci 637db96d56Sopenharmony_ci Encodings are first looked up in the registry's cache. If not found, the list of 647db96d56Sopenharmony_ci registered search functions is scanned. If no :class:`CodecInfo` object is 657db96d56Sopenharmony_ci found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object 667db96d56Sopenharmony_ci is stored in the cache and returned to the caller. 677db96d56Sopenharmony_ci 687db96d56Sopenharmony_ci.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None) 697db96d56Sopenharmony_ci 707db96d56Sopenharmony_ci Codec details when looking up the codec registry. The constructor 717db96d56Sopenharmony_ci arguments are stored in attributes of the same name: 727db96d56Sopenharmony_ci 737db96d56Sopenharmony_ci 747db96d56Sopenharmony_ci .. attribute:: name 757db96d56Sopenharmony_ci 767db96d56Sopenharmony_ci The name of the encoding. 777db96d56Sopenharmony_ci 787db96d56Sopenharmony_ci 797db96d56Sopenharmony_ci .. attribute:: encode 807db96d56Sopenharmony_ci decode 817db96d56Sopenharmony_ci 827db96d56Sopenharmony_ci The stateless encoding and decoding functions. These must be 837db96d56Sopenharmony_ci functions or methods which have the same interface as 847db96d56Sopenharmony_ci the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec 857db96d56Sopenharmony_ci instances (see :ref:`Codec Interface <codec-objects>`). 867db96d56Sopenharmony_ci The functions or methods are expected to work in a stateless mode. 877db96d56Sopenharmony_ci 887db96d56Sopenharmony_ci 897db96d56Sopenharmony_ci .. attribute:: incrementalencoder 907db96d56Sopenharmony_ci incrementaldecoder 917db96d56Sopenharmony_ci 927db96d56Sopenharmony_ci Incremental encoder and decoder classes or factory functions. 937db96d56Sopenharmony_ci These have to provide the interface defined by the base classes 947db96d56Sopenharmony_ci :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, 957db96d56Sopenharmony_ci respectively. Incremental codecs can maintain state. 967db96d56Sopenharmony_ci 977db96d56Sopenharmony_ci 987db96d56Sopenharmony_ci .. attribute:: streamwriter 997db96d56Sopenharmony_ci streamreader 1007db96d56Sopenharmony_ci 1017db96d56Sopenharmony_ci Stream writer and reader classes or factory functions. These have to 1027db96d56Sopenharmony_ci provide the interface defined by the base classes 1037db96d56Sopenharmony_ci :class:`StreamWriter` and :class:`StreamReader`, respectively. 1047db96d56Sopenharmony_ci Stream codecs can maintain state. 1057db96d56Sopenharmony_ci 1067db96d56Sopenharmony_ciTo simplify access to the various codec components, the module provides 1077db96d56Sopenharmony_cithese additional functions which use :func:`lookup` for the codec lookup: 1087db96d56Sopenharmony_ci 1097db96d56Sopenharmony_ci.. function:: getencoder(encoding) 1107db96d56Sopenharmony_ci 1117db96d56Sopenharmony_ci Look up the codec for the given encoding and return its encoder function. 1127db96d56Sopenharmony_ci 1137db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found. 1147db96d56Sopenharmony_ci 1157db96d56Sopenharmony_ci 1167db96d56Sopenharmony_ci.. function:: getdecoder(encoding) 1177db96d56Sopenharmony_ci 1187db96d56Sopenharmony_ci Look up the codec for the given encoding and return its decoder function. 1197db96d56Sopenharmony_ci 1207db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found. 1217db96d56Sopenharmony_ci 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ci.. function:: getincrementalencoder(encoding) 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci Look up the codec for the given encoding and return its incremental encoder 1267db96d56Sopenharmony_ci class or factory function. 1277db96d56Sopenharmony_ci 1287db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found or the codec 1297db96d56Sopenharmony_ci doesn't support an incremental encoder. 1307db96d56Sopenharmony_ci 1317db96d56Sopenharmony_ci 1327db96d56Sopenharmony_ci.. function:: getincrementaldecoder(encoding) 1337db96d56Sopenharmony_ci 1347db96d56Sopenharmony_ci Look up the codec for the given encoding and return its incremental decoder 1357db96d56Sopenharmony_ci class or factory function. 1367db96d56Sopenharmony_ci 1377db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found or the codec 1387db96d56Sopenharmony_ci doesn't support an incremental decoder. 1397db96d56Sopenharmony_ci 1407db96d56Sopenharmony_ci 1417db96d56Sopenharmony_ci.. function:: getreader(encoding) 1427db96d56Sopenharmony_ci 1437db96d56Sopenharmony_ci Look up the codec for the given encoding and return its :class:`StreamReader` 1447db96d56Sopenharmony_ci class or factory function. 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found. 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci 1497db96d56Sopenharmony_ci.. function:: getwriter(encoding) 1507db96d56Sopenharmony_ci 1517db96d56Sopenharmony_ci Look up the codec for the given encoding and return its :class:`StreamWriter` 1527db96d56Sopenharmony_ci class or factory function. 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the encoding cannot be found. 1557db96d56Sopenharmony_ci 1567db96d56Sopenharmony_ciCustom codecs are made available by registering a suitable codec search 1577db96d56Sopenharmony_cifunction: 1587db96d56Sopenharmony_ci 1597db96d56Sopenharmony_ci.. function:: register(search_function) 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci Register a codec search function. Search functions are expected to take one 1627db96d56Sopenharmony_ci argument, being the encoding name in all lower case letters with hyphens 1637db96d56Sopenharmony_ci and spaces converted to underscores, and return a :class:`CodecInfo` object. 1647db96d56Sopenharmony_ci In case a search function cannot find a given encoding, it should return 1657db96d56Sopenharmony_ci ``None``. 1667db96d56Sopenharmony_ci 1677db96d56Sopenharmony_ci .. versionchanged:: 3.9 1687db96d56Sopenharmony_ci Hyphens and spaces are converted to underscore. 1697db96d56Sopenharmony_ci 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci.. function:: unregister(search_function) 1727db96d56Sopenharmony_ci 1737db96d56Sopenharmony_ci Unregister a codec search function and clear the registry's cache. 1747db96d56Sopenharmony_ci If the search function is not registered, do nothing. 1757db96d56Sopenharmony_ci 1767db96d56Sopenharmony_ci .. versionadded:: 3.10 1777db96d56Sopenharmony_ci 1787db96d56Sopenharmony_ci 1797db96d56Sopenharmony_ciWhile the builtin :func:`open` and the associated :mod:`io` module are the 1807db96d56Sopenharmony_cirecommended approach for working with encoded text files, this module 1817db96d56Sopenharmony_ciprovides additional utility functions and classes that allow the use of a 1827db96d56Sopenharmony_ciwider range of codecs when working with binary files: 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=-1) 1857db96d56Sopenharmony_ci 1867db96d56Sopenharmony_ci Open an encoded file using the given *mode* and return an instance of 1877db96d56Sopenharmony_ci :class:`StreamReaderWriter`, providing transparent encoding/decoding. 1887db96d56Sopenharmony_ci The default file mode is ``'r'``, meaning to open the file in read mode. 1897db96d56Sopenharmony_ci 1907db96d56Sopenharmony_ci .. note:: 1917db96d56Sopenharmony_ci 1927db96d56Sopenharmony_ci If *encoding* is not ``None``, then the 1937db96d56Sopenharmony_ci underlying encoded files are always opened in binary mode. 1947db96d56Sopenharmony_ci No automatic conversion of ``'\n'`` is done on reading and writing. 1957db96d56Sopenharmony_ci The *mode* argument may be any binary mode acceptable to the built-in 1967db96d56Sopenharmony_ci :func:`open` function; the ``'b'`` is automatically added. 1977db96d56Sopenharmony_ci 1987db96d56Sopenharmony_ci *encoding* specifies the encoding which is to be used for the file. 1997db96d56Sopenharmony_ci Any encoding that encodes to and decodes from bytes is allowed, and 2007db96d56Sopenharmony_ci the data types supported by the file methods depend on the codec used. 2017db96d56Sopenharmony_ci 2027db96d56Sopenharmony_ci *errors* may be given to define the error handling. It defaults to ``'strict'`` 2037db96d56Sopenharmony_ci which causes a :exc:`ValueError` to be raised in case an encoding error occurs. 2047db96d56Sopenharmony_ci 2057db96d56Sopenharmony_ci *buffering* has the same meaning as for the built-in :func:`open` function. 2067db96d56Sopenharmony_ci It defaults to -1 which means that the default buffer size will be used. 2077db96d56Sopenharmony_ci 2087db96d56Sopenharmony_ci .. versionchanged:: 3.11 2097db96d56Sopenharmony_ci The ``'U'`` mode has been removed. 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ci 2127db96d56Sopenharmony_ci.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict') 2137db96d56Sopenharmony_ci 2147db96d56Sopenharmony_ci Return a :class:`StreamRecoder` instance, a wrapped version of *file* 2157db96d56Sopenharmony_ci which provides transparent transcoding. The original file is closed 2167db96d56Sopenharmony_ci when the wrapped version is closed. 2177db96d56Sopenharmony_ci 2187db96d56Sopenharmony_ci Data written to the wrapped file is decoded according to the given 2197db96d56Sopenharmony_ci *data_encoding* and then written to the original file as bytes using 2207db96d56Sopenharmony_ci *file_encoding*. Bytes read from the original file are decoded 2217db96d56Sopenharmony_ci according to *file_encoding*, and the result is encoded 2227db96d56Sopenharmony_ci using *data_encoding*. 2237db96d56Sopenharmony_ci 2247db96d56Sopenharmony_ci If *file_encoding* is not given, it defaults to *data_encoding*. 2257db96d56Sopenharmony_ci 2267db96d56Sopenharmony_ci *errors* may be given to define the error handling. It defaults to 2277db96d56Sopenharmony_ci ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding 2287db96d56Sopenharmony_ci error occurs. 2297db96d56Sopenharmony_ci 2307db96d56Sopenharmony_ci 2317db96d56Sopenharmony_ci.. function:: iterencode(iterator, encoding, errors='strict', **kwargs) 2327db96d56Sopenharmony_ci 2337db96d56Sopenharmony_ci Uses an incremental encoder to iteratively encode the input provided by 2347db96d56Sopenharmony_ci *iterator*. This function is a :term:`generator`. 2357db96d56Sopenharmony_ci The *errors* argument (as well as any 2367db96d56Sopenharmony_ci other keyword argument) is passed through to the incremental encoder. 2377db96d56Sopenharmony_ci 2387db96d56Sopenharmony_ci This function requires that the codec accept text :class:`str` objects 2397db96d56Sopenharmony_ci to encode. Therefore it does not support bytes-to-bytes encoders such as 2407db96d56Sopenharmony_ci ``base64_codec``. 2417db96d56Sopenharmony_ci 2427db96d56Sopenharmony_ci 2437db96d56Sopenharmony_ci.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs) 2447db96d56Sopenharmony_ci 2457db96d56Sopenharmony_ci Uses an incremental decoder to iteratively decode the input provided by 2467db96d56Sopenharmony_ci *iterator*. This function is a :term:`generator`. 2477db96d56Sopenharmony_ci The *errors* argument (as well as any 2487db96d56Sopenharmony_ci other keyword argument) is passed through to the incremental decoder. 2497db96d56Sopenharmony_ci 2507db96d56Sopenharmony_ci This function requires that the codec accept :class:`bytes` objects 2517db96d56Sopenharmony_ci to decode. Therefore it does not support text-to-text encoders such as 2527db96d56Sopenharmony_ci ``rot_13``, although ``rot_13`` may be used equivalently with 2537db96d56Sopenharmony_ci :func:`iterencode`. 2547db96d56Sopenharmony_ci 2557db96d56Sopenharmony_ci 2567db96d56Sopenharmony_ciThe module also provides the following constants which are useful for reading 2577db96d56Sopenharmony_ciand writing to platform dependent files: 2587db96d56Sopenharmony_ci 2597db96d56Sopenharmony_ci 2607db96d56Sopenharmony_ci.. data:: BOM 2617db96d56Sopenharmony_ci BOM_BE 2627db96d56Sopenharmony_ci BOM_LE 2637db96d56Sopenharmony_ci BOM_UTF8 2647db96d56Sopenharmony_ci BOM_UTF16 2657db96d56Sopenharmony_ci BOM_UTF16_BE 2667db96d56Sopenharmony_ci BOM_UTF16_LE 2677db96d56Sopenharmony_ci BOM_UTF32 2687db96d56Sopenharmony_ci BOM_UTF32_BE 2697db96d56Sopenharmony_ci BOM_UTF32_LE 2707db96d56Sopenharmony_ci 2717db96d56Sopenharmony_ci These constants define various byte sequences, 2727db96d56Sopenharmony_ci being Unicode byte order marks (BOMs) for several encodings. They are 2737db96d56Sopenharmony_ci used in UTF-16 and UTF-32 data streams to indicate the byte order used, 2747db96d56Sopenharmony_ci and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either 2757db96d56Sopenharmony_ci :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's 2767db96d56Sopenharmony_ci native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, 2777db96d56Sopenharmony_ci :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for 2787db96d56Sopenharmony_ci :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32 2797db96d56Sopenharmony_ci encodings. 2807db96d56Sopenharmony_ci 2817db96d56Sopenharmony_ci 2827db96d56Sopenharmony_ci.. _codec-base-classes: 2837db96d56Sopenharmony_ci 2847db96d56Sopenharmony_ciCodec Base Classes 2857db96d56Sopenharmony_ci------------------ 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ciThe :mod:`codecs` module defines a set of base classes which define the 2887db96d56Sopenharmony_ciinterfaces for working with codec objects, and can also be used as the basis 2897db96d56Sopenharmony_cifor custom codec implementations. 2907db96d56Sopenharmony_ci 2917db96d56Sopenharmony_ciEach codec has to define four interfaces to make it usable as codec in Python: 2927db96d56Sopenharmony_cistateless encoder, stateless decoder, stream reader and stream writer. The 2937db96d56Sopenharmony_cistream reader and writers typically reuse the stateless encoder/decoder to 2947db96d56Sopenharmony_ciimplement the file protocols. Codec authors also need to define how the 2957db96d56Sopenharmony_cicodec will handle encoding and decoding errors. 2967db96d56Sopenharmony_ci 2977db96d56Sopenharmony_ci 2987db96d56Sopenharmony_ci.. _surrogateescape: 2997db96d56Sopenharmony_ci.. _error-handlers: 3007db96d56Sopenharmony_ci 3017db96d56Sopenharmony_ciError Handlers 3027db96d56Sopenharmony_ci^^^^^^^^^^^^^^ 3037db96d56Sopenharmony_ci 3047db96d56Sopenharmony_ciTo simplify and standardize error handling, codecs may implement different 3057db96d56Sopenharmony_cierror handling schemes by accepting the *errors* string argument: 3067db96d56Sopenharmony_ci 3077db96d56Sopenharmony_ci >>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace') 3087db96d56Sopenharmony_ci b'German \\xdf, \\u266c' 3097db96d56Sopenharmony_ci >>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace') 3107db96d56Sopenharmony_ci b'German ß, ♬' 3117db96d56Sopenharmony_ci 3127db96d56Sopenharmony_ci.. index:: 3137db96d56Sopenharmony_ci pair: strict; error handler's name 3147db96d56Sopenharmony_ci pair: ignore; error handler's name 3157db96d56Sopenharmony_ci pair: replace; error handler's name 3167db96d56Sopenharmony_ci pair: backslashreplace; error handler's name 3177db96d56Sopenharmony_ci pair: surrogateescape; error handler's name 3187db96d56Sopenharmony_ci single: ? (question mark); replacement character 3197db96d56Sopenharmony_ci single: \ (backslash); escape sequence 3207db96d56Sopenharmony_ci single: \x; escape sequence 3217db96d56Sopenharmony_ci single: \u; escape sequence 3227db96d56Sopenharmony_ci single: \U; escape sequence 3237db96d56Sopenharmony_ci 3247db96d56Sopenharmony_ciThe following error handlers can be used with all Python 3257db96d56Sopenharmony_ci:ref:`standard-encodings` codecs: 3267db96d56Sopenharmony_ci 3277db96d56Sopenharmony_ci.. tabularcolumns:: |l|L| 3287db96d56Sopenharmony_ci 3297db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3307db96d56Sopenharmony_ci| Value | Meaning | 3317db96d56Sopenharmony_ci+=========================+===============================================+ 3327db96d56Sopenharmony_ci| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), | 3337db96d56Sopenharmony_ci| | this is the default. Implemented in | 3347db96d56Sopenharmony_ci| | :func:`strict_errors`. | 3357db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3367db96d56Sopenharmony_ci| ``'ignore'`` | Ignore the malformed data and continue without| 3377db96d56Sopenharmony_ci| | further notice. Implemented in | 3387db96d56Sopenharmony_ci| | :func:`ignore_errors`. | 3397db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3407db96d56Sopenharmony_ci| ``'replace'`` | Replace with a replacement marker. On | 3417db96d56Sopenharmony_ci| | encoding, use ``?`` (ASCII character). On | 3427db96d56Sopenharmony_ci| | decoding, use ``�`` (U+FFFD, the official | 3437db96d56Sopenharmony_ci| | REPLACEMENT CHARACTER). Implemented in | 3447db96d56Sopenharmony_ci| | :func:`replace_errors`. | 3457db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3467db96d56Sopenharmony_ci| ``'backslashreplace'`` | Replace with backslashed escape sequences. | 3477db96d56Sopenharmony_ci| | On encoding, use hexadecimal form of Unicode | 3487db96d56Sopenharmony_ci| | code point with formats ``\xhh`` ``\uxxxx`` | 3497db96d56Sopenharmony_ci| | ``\Uxxxxxxxx``. On decoding, use hexadecimal | 3507db96d56Sopenharmony_ci| | form of byte value with format ``\xhh``. | 3517db96d56Sopenharmony_ci| | Implemented in | 3527db96d56Sopenharmony_ci| | :func:`backslashreplace_errors`. | 3537db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3547db96d56Sopenharmony_ci| ``'surrogateescape'`` | On decoding, replace byte with individual | 3557db96d56Sopenharmony_ci| | surrogate code ranging from ``U+DC80`` to | 3567db96d56Sopenharmony_ci| | ``U+DCFF``. This code will then be turned | 3577db96d56Sopenharmony_ci| | back into the same byte when the | 3587db96d56Sopenharmony_ci| | ``'surrogateescape'`` error handler is used | 3597db96d56Sopenharmony_ci| | when encoding the data. (See :pep:`383` for | 3607db96d56Sopenharmony_ci| | more.) | 3617db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3627db96d56Sopenharmony_ci 3637db96d56Sopenharmony_ci.. index:: 3647db96d56Sopenharmony_ci pair: xmlcharrefreplace; error handler's name 3657db96d56Sopenharmony_ci pair: namereplace; error handler's name 3667db96d56Sopenharmony_ci single: \N; escape sequence 3677db96d56Sopenharmony_ci 3687db96d56Sopenharmony_ciThe following error handlers are only applicable to encoding (within 3697db96d56Sopenharmony_ci:term:`text encodings <text encoding>`): 3707db96d56Sopenharmony_ci 3717db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3727db96d56Sopenharmony_ci| Value | Meaning | 3737db96d56Sopenharmony_ci+=========================+===============================================+ 3747db96d56Sopenharmony_ci| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character | 3757db96d56Sopenharmony_ci| | reference, which is a decimal form of Unicode | 3767db96d56Sopenharmony_ci| | code point with format ``&#num;`` Implemented | 3777db96d56Sopenharmony_ci| | in :func:`xmlcharrefreplace_errors`. | 3787db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3797db96d56Sopenharmony_ci| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, | 3807db96d56Sopenharmony_ci| | what appears in the braces is the Name | 3817db96d56Sopenharmony_ci| | property from Unicode Character Database. | 3827db96d56Sopenharmony_ci| | Implemented in :func:`namereplace_errors`. | 3837db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+ 3847db96d56Sopenharmony_ci 3857db96d56Sopenharmony_ci.. index:: 3867db96d56Sopenharmony_ci pair: surrogatepass; error handler's name 3877db96d56Sopenharmony_ci 3887db96d56Sopenharmony_ciIn addition, the following error handler is specific to the given codecs: 3897db96d56Sopenharmony_ci 3907db96d56Sopenharmony_ci+-------------------+------------------------+-------------------------------------------+ 3917db96d56Sopenharmony_ci| Value | Codecs | Meaning | 3927db96d56Sopenharmony_ci+===================+========================+===========================================+ 3937db96d56Sopenharmony_ci|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code| 3947db96d56Sopenharmony_ci| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal | 3957db96d56Sopenharmony_ci| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat | 3967db96d56Sopenharmony_ci| | | the presence of surrogate code point in | 3977db96d56Sopenharmony_ci| | | :class:`str` as an error. | 3987db96d56Sopenharmony_ci+-------------------+------------------------+-------------------------------------------+ 3997db96d56Sopenharmony_ci 4007db96d56Sopenharmony_ci.. versionadded:: 3.1 4017db96d56Sopenharmony_ci The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers. 4027db96d56Sopenharmony_ci 4037db96d56Sopenharmony_ci.. versionchanged:: 3.4 4047db96d56Sopenharmony_ci The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\* 4057db96d56Sopenharmony_ci codecs. 4067db96d56Sopenharmony_ci 4077db96d56Sopenharmony_ci.. versionadded:: 3.5 4087db96d56Sopenharmony_ci The ``'namereplace'`` error handler. 4097db96d56Sopenharmony_ci 4107db96d56Sopenharmony_ci.. versionchanged:: 3.5 4117db96d56Sopenharmony_ci The ``'backslashreplace'`` error handler now works with decoding and 4127db96d56Sopenharmony_ci translating. 4137db96d56Sopenharmony_ci 4147db96d56Sopenharmony_ciThe set of allowed values can be extended by registering a new named error 4157db96d56Sopenharmony_cihandler: 4167db96d56Sopenharmony_ci 4177db96d56Sopenharmony_ci.. function:: register_error(name, error_handler) 4187db96d56Sopenharmony_ci 4197db96d56Sopenharmony_ci Register the error handling function *error_handler* under the name *name*. 4207db96d56Sopenharmony_ci The *error_handler* argument will be called during encoding and decoding 4217db96d56Sopenharmony_ci in case of an error, when *name* is specified as the errors parameter. 4227db96d56Sopenharmony_ci 4237db96d56Sopenharmony_ci For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError` 4247db96d56Sopenharmony_ci instance, which contains information about the location of the error. The 4257db96d56Sopenharmony_ci error handler must either raise this or a different exception, or return a 4267db96d56Sopenharmony_ci tuple with a replacement for the unencodable part of the input and a position 4277db96d56Sopenharmony_ci where encoding should continue. The replacement may be either :class:`str` or 4287db96d56Sopenharmony_ci :class:`bytes`. If the replacement is bytes, the encoder will simply copy 4297db96d56Sopenharmony_ci them into the output buffer. If the replacement is a string, the encoder will 4307db96d56Sopenharmony_ci encode the replacement. Encoding continues on original input at the 4317db96d56Sopenharmony_ci specified position. Negative position values will be treated as being 4327db96d56Sopenharmony_ci relative to the end of the input string. If the resulting position is out of 4337db96d56Sopenharmony_ci bound an :exc:`IndexError` will be raised. 4347db96d56Sopenharmony_ci 4357db96d56Sopenharmony_ci Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or 4367db96d56Sopenharmony_ci :exc:`UnicodeTranslateError` will be passed to the handler and that the 4377db96d56Sopenharmony_ci replacement from the error handler will be put into the output directly. 4387db96d56Sopenharmony_ci 4397db96d56Sopenharmony_ci 4407db96d56Sopenharmony_ciPreviously registered error handlers (including the standard error handlers) 4417db96d56Sopenharmony_cican be looked up by name: 4427db96d56Sopenharmony_ci 4437db96d56Sopenharmony_ci.. function:: lookup_error(name) 4447db96d56Sopenharmony_ci 4457db96d56Sopenharmony_ci Return the error handler previously registered under the name *name*. 4467db96d56Sopenharmony_ci 4477db96d56Sopenharmony_ci Raises a :exc:`LookupError` in case the handler cannot be found. 4487db96d56Sopenharmony_ci 4497db96d56Sopenharmony_ciThe following standard error handlers are also made available as module level 4507db96d56Sopenharmony_cifunctions: 4517db96d56Sopenharmony_ci 4527db96d56Sopenharmony_ci.. function:: strict_errors(exception) 4537db96d56Sopenharmony_ci 4547db96d56Sopenharmony_ci Implements the ``'strict'`` error handling. 4557db96d56Sopenharmony_ci 4567db96d56Sopenharmony_ci Each encoding or decoding error raises a :exc:`UnicodeError`. 4577db96d56Sopenharmony_ci 4587db96d56Sopenharmony_ci 4597db96d56Sopenharmony_ci.. function:: ignore_errors(exception) 4607db96d56Sopenharmony_ci 4617db96d56Sopenharmony_ci Implements the ``'ignore'`` error handling. 4627db96d56Sopenharmony_ci 4637db96d56Sopenharmony_ci Malformed data is ignored; encoding or decoding is continued without 4647db96d56Sopenharmony_ci further notice. 4657db96d56Sopenharmony_ci 4667db96d56Sopenharmony_ci 4677db96d56Sopenharmony_ci.. function:: replace_errors(exception) 4687db96d56Sopenharmony_ci 4697db96d56Sopenharmony_ci Implements the ``'replace'`` error handling. 4707db96d56Sopenharmony_ci 4717db96d56Sopenharmony_ci Substitutes ``?`` (ASCII character) for encoding errors or ``�`` (U+FFFD, 4727db96d56Sopenharmony_ci the official REPLACEMENT CHARACTER) for decoding errors. 4737db96d56Sopenharmony_ci 4747db96d56Sopenharmony_ci 4757db96d56Sopenharmony_ci.. function:: backslashreplace_errors(exception) 4767db96d56Sopenharmony_ci 4777db96d56Sopenharmony_ci Implements the ``'backslashreplace'`` error handling. 4787db96d56Sopenharmony_ci 4797db96d56Sopenharmony_ci Malformed data is replaced by a backslashed escape sequence. 4807db96d56Sopenharmony_ci On encoding, use the hexadecimal form of Unicode code point with formats 4817db96d56Sopenharmony_ci ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of 4827db96d56Sopenharmony_ci byte value with format ``\xhh``. 4837db96d56Sopenharmony_ci 4847db96d56Sopenharmony_ci .. versionchanged:: 3.5 4857db96d56Sopenharmony_ci Works with decoding and translating. 4867db96d56Sopenharmony_ci 4877db96d56Sopenharmony_ci 4887db96d56Sopenharmony_ci.. function:: xmlcharrefreplace_errors(exception) 4897db96d56Sopenharmony_ci 4907db96d56Sopenharmony_ci Implements the ``'xmlcharrefreplace'`` error handling (for encoding within 4917db96d56Sopenharmony_ci :term:`text encoding` only). 4927db96d56Sopenharmony_ci 4937db96d56Sopenharmony_ci The unencodable character is replaced by an appropriate XML/HTML numeric 4947db96d56Sopenharmony_ci character reference, which is a decimal form of Unicode code point with 4957db96d56Sopenharmony_ci format ``&#num;`` . 4967db96d56Sopenharmony_ci 4977db96d56Sopenharmony_ci 4987db96d56Sopenharmony_ci.. function:: namereplace_errors(exception) 4997db96d56Sopenharmony_ci 5007db96d56Sopenharmony_ci Implements the ``'namereplace'`` error handling (for encoding within 5017db96d56Sopenharmony_ci :term:`text encoding` only). 5027db96d56Sopenharmony_ci 5037db96d56Sopenharmony_ci The unencodable character is replaced by a ``\N{...}`` escape sequence. The 5047db96d56Sopenharmony_ci set of characters that appear in the braces is the Name property from 5057db96d56Sopenharmony_ci Unicode Character Database. For example, the German lowercase letter ``'ß'`` 5067db96d56Sopenharmony_ci will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` . 5077db96d56Sopenharmony_ci 5087db96d56Sopenharmony_ci .. versionadded:: 3.5 5097db96d56Sopenharmony_ci 5107db96d56Sopenharmony_ci 5117db96d56Sopenharmony_ci.. _codec-objects: 5127db96d56Sopenharmony_ci 5137db96d56Sopenharmony_ciStateless Encoding and Decoding 5147db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5157db96d56Sopenharmony_ci 5167db96d56Sopenharmony_ciThe base :class:`Codec` class defines these methods which also define the 5177db96d56Sopenharmony_cifunction interfaces of the stateless encoder and decoder: 5187db96d56Sopenharmony_ci 5197db96d56Sopenharmony_ci 5207db96d56Sopenharmony_ci.. method:: Codec.encode(input, errors='strict') 5217db96d56Sopenharmony_ci 5227db96d56Sopenharmony_ci Encodes the object *input* and returns a tuple (output object, length consumed). 5237db96d56Sopenharmony_ci For instance, :term:`text encoding` converts 5247db96d56Sopenharmony_ci a string object to a bytes object using a particular 5257db96d56Sopenharmony_ci character set encoding (e.g., ``cp1252`` or ``iso-8859-1``). 5267db96d56Sopenharmony_ci 5277db96d56Sopenharmony_ci The *errors* argument defines the error handling to apply. 5287db96d56Sopenharmony_ci It defaults to ``'strict'`` handling. 5297db96d56Sopenharmony_ci 5307db96d56Sopenharmony_ci The method may not store state in the :class:`Codec` instance. Use 5317db96d56Sopenharmony_ci :class:`StreamWriter` for codecs which have to keep state in order to make 5327db96d56Sopenharmony_ci encoding efficient. 5337db96d56Sopenharmony_ci 5347db96d56Sopenharmony_ci The encoder must be able to handle zero length input and return an empty object 5357db96d56Sopenharmony_ci of the output object type in this situation. 5367db96d56Sopenharmony_ci 5377db96d56Sopenharmony_ci 5387db96d56Sopenharmony_ci.. method:: Codec.decode(input, errors='strict') 5397db96d56Sopenharmony_ci 5407db96d56Sopenharmony_ci Decodes the object *input* and returns a tuple (output object, length 5417db96d56Sopenharmony_ci consumed). For instance, for a :term:`text encoding`, decoding converts 5427db96d56Sopenharmony_ci a bytes object encoded using a particular 5437db96d56Sopenharmony_ci character set encoding to a string object. 5447db96d56Sopenharmony_ci 5457db96d56Sopenharmony_ci For text encodings and bytes-to-bytes codecs, 5467db96d56Sopenharmony_ci *input* must be a bytes object or one which provides the read-only 5477db96d56Sopenharmony_ci buffer interface -- for example, buffer objects and memory mapped files. 5487db96d56Sopenharmony_ci 5497db96d56Sopenharmony_ci The *errors* argument defines the error handling to apply. 5507db96d56Sopenharmony_ci It defaults to ``'strict'`` handling. 5517db96d56Sopenharmony_ci 5527db96d56Sopenharmony_ci The method may not store state in the :class:`Codec` instance. Use 5537db96d56Sopenharmony_ci :class:`StreamReader` for codecs which have to keep state in order to make 5547db96d56Sopenharmony_ci decoding efficient. 5557db96d56Sopenharmony_ci 5567db96d56Sopenharmony_ci The decoder must be able to handle zero length input and return an empty object 5577db96d56Sopenharmony_ci of the output object type in this situation. 5587db96d56Sopenharmony_ci 5597db96d56Sopenharmony_ci 5607db96d56Sopenharmony_ciIncremental Encoding and Decoding 5617db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5627db96d56Sopenharmony_ci 5637db96d56Sopenharmony_ciThe :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide 5647db96d56Sopenharmony_cithe basic interface for incremental encoding and decoding. Encoding/decoding the 5657db96d56Sopenharmony_ciinput isn't done with one call to the stateless encoder/decoder function, but 5667db96d56Sopenharmony_ciwith multiple calls to the 5677db96d56Sopenharmony_ci:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of 5687db96d56Sopenharmony_cithe incremental encoder/decoder. The incremental encoder/decoder keeps track of 5697db96d56Sopenharmony_cithe encoding/decoding process during method calls. 5707db96d56Sopenharmony_ci 5717db96d56Sopenharmony_ciThe joined output of calls to the 5727db96d56Sopenharmony_ci:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is 5737db96d56Sopenharmony_cithe same as if all the single inputs were joined into one, and this input was 5747db96d56Sopenharmony_ciencoded/decoded with the stateless encoder/decoder. 5757db96d56Sopenharmony_ci 5767db96d56Sopenharmony_ci 5777db96d56Sopenharmony_ci.. _incremental-encoder-objects: 5787db96d56Sopenharmony_ci 5797db96d56Sopenharmony_ciIncrementalEncoder Objects 5807db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 5817db96d56Sopenharmony_ci 5827db96d56Sopenharmony_ciThe :class:`IncrementalEncoder` class is used for encoding an input in multiple 5837db96d56Sopenharmony_cisteps. It defines the following methods which every incremental encoder must 5847db96d56Sopenharmony_cidefine in order to be compatible with the Python codec registry. 5857db96d56Sopenharmony_ci 5867db96d56Sopenharmony_ci 5877db96d56Sopenharmony_ci.. class:: IncrementalEncoder(errors='strict') 5887db96d56Sopenharmony_ci 5897db96d56Sopenharmony_ci Constructor for an :class:`IncrementalEncoder` instance. 5907db96d56Sopenharmony_ci 5917db96d56Sopenharmony_ci All incremental encoders must provide this constructor interface. They are free 5927db96d56Sopenharmony_ci to add additional keyword arguments, but only the ones defined here are used by 5937db96d56Sopenharmony_ci the Python codec registry. 5947db96d56Sopenharmony_ci 5957db96d56Sopenharmony_ci The :class:`IncrementalEncoder` may implement different error handling schemes 5967db96d56Sopenharmony_ci by providing the *errors* keyword argument. See :ref:`error-handlers` for 5977db96d56Sopenharmony_ci possible values. 5987db96d56Sopenharmony_ci 5997db96d56Sopenharmony_ci The *errors* argument will be assigned to an attribute of the same name. 6007db96d56Sopenharmony_ci Assigning to this attribute makes it possible to switch between different error 6017db96d56Sopenharmony_ci handling strategies during the lifetime of the :class:`IncrementalEncoder` 6027db96d56Sopenharmony_ci object. 6037db96d56Sopenharmony_ci 6047db96d56Sopenharmony_ci 6057db96d56Sopenharmony_ci .. method:: encode(object, final=False) 6067db96d56Sopenharmony_ci 6077db96d56Sopenharmony_ci Encodes *object* (taking the current state of the encoder into account) 6087db96d56Sopenharmony_ci and returns the resulting encoded object. If this is the last call to 6097db96d56Sopenharmony_ci :meth:`encode` *final* must be true (the default is false). 6107db96d56Sopenharmony_ci 6117db96d56Sopenharmony_ci 6127db96d56Sopenharmony_ci .. method:: reset() 6137db96d56Sopenharmony_ci 6147db96d56Sopenharmony_ci Reset the encoder to the initial state. The output is discarded: call 6157db96d56Sopenharmony_ci ``.encode(object, final=True)``, passing an empty byte or text string 6167db96d56Sopenharmony_ci if necessary, to reset the encoder and to get the output. 6177db96d56Sopenharmony_ci 6187db96d56Sopenharmony_ci 6197db96d56Sopenharmony_ci .. method:: getstate() 6207db96d56Sopenharmony_ci 6217db96d56Sopenharmony_ci Return the current state of the encoder which must be an integer. The 6227db96d56Sopenharmony_ci implementation should make sure that ``0`` is the most common 6237db96d56Sopenharmony_ci state. (States that are more complicated than integers can be converted 6247db96d56Sopenharmony_ci into an integer by marshaling/pickling the state and encoding the bytes 6257db96d56Sopenharmony_ci of the resulting string into an integer.) 6267db96d56Sopenharmony_ci 6277db96d56Sopenharmony_ci 6287db96d56Sopenharmony_ci .. method:: setstate(state) 6297db96d56Sopenharmony_ci 6307db96d56Sopenharmony_ci Set the state of the encoder to *state*. *state* must be an encoder state 6317db96d56Sopenharmony_ci returned by :meth:`getstate`. 6327db96d56Sopenharmony_ci 6337db96d56Sopenharmony_ci 6347db96d56Sopenharmony_ci.. _incremental-decoder-objects: 6357db96d56Sopenharmony_ci 6367db96d56Sopenharmony_ciIncrementalDecoder Objects 6377db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 6387db96d56Sopenharmony_ci 6397db96d56Sopenharmony_ciThe :class:`IncrementalDecoder` class is used for decoding an input in multiple 6407db96d56Sopenharmony_cisteps. It defines the following methods which every incremental decoder must 6417db96d56Sopenharmony_cidefine in order to be compatible with the Python codec registry. 6427db96d56Sopenharmony_ci 6437db96d56Sopenharmony_ci 6447db96d56Sopenharmony_ci.. class:: IncrementalDecoder(errors='strict') 6457db96d56Sopenharmony_ci 6467db96d56Sopenharmony_ci Constructor for an :class:`IncrementalDecoder` instance. 6477db96d56Sopenharmony_ci 6487db96d56Sopenharmony_ci All incremental decoders must provide this constructor interface. They are free 6497db96d56Sopenharmony_ci to add additional keyword arguments, but only the ones defined here are used by 6507db96d56Sopenharmony_ci the Python codec registry. 6517db96d56Sopenharmony_ci 6527db96d56Sopenharmony_ci The :class:`IncrementalDecoder` may implement different error handling schemes 6537db96d56Sopenharmony_ci by providing the *errors* keyword argument. See :ref:`error-handlers` for 6547db96d56Sopenharmony_ci possible values. 6557db96d56Sopenharmony_ci 6567db96d56Sopenharmony_ci The *errors* argument will be assigned to an attribute of the same name. 6577db96d56Sopenharmony_ci Assigning to this attribute makes it possible to switch between different error 6587db96d56Sopenharmony_ci handling strategies during the lifetime of the :class:`IncrementalDecoder` 6597db96d56Sopenharmony_ci object. 6607db96d56Sopenharmony_ci 6617db96d56Sopenharmony_ci 6627db96d56Sopenharmony_ci .. method:: decode(object, final=False) 6637db96d56Sopenharmony_ci 6647db96d56Sopenharmony_ci Decodes *object* (taking the current state of the decoder into account) 6657db96d56Sopenharmony_ci and returns the resulting decoded object. If this is the last call to 6667db96d56Sopenharmony_ci :meth:`decode` *final* must be true (the default is false). If *final* is 6677db96d56Sopenharmony_ci true the decoder must decode the input completely and must flush all 6687db96d56Sopenharmony_ci buffers. If this isn't possible (e.g. because of incomplete byte sequences 6697db96d56Sopenharmony_ci at the end of the input) it must initiate error handling just like in the 6707db96d56Sopenharmony_ci stateless case (which might raise an exception). 6717db96d56Sopenharmony_ci 6727db96d56Sopenharmony_ci 6737db96d56Sopenharmony_ci .. method:: reset() 6747db96d56Sopenharmony_ci 6757db96d56Sopenharmony_ci Reset the decoder to the initial state. 6767db96d56Sopenharmony_ci 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci .. method:: getstate() 6797db96d56Sopenharmony_ci 6807db96d56Sopenharmony_ci Return the current state of the decoder. This must be a tuple with two 6817db96d56Sopenharmony_ci items, the first must be the buffer containing the still undecoded 6827db96d56Sopenharmony_ci input. The second must be an integer and can be additional state 6837db96d56Sopenharmony_ci info. (The implementation should make sure that ``0`` is the most common 6847db96d56Sopenharmony_ci additional state info.) If this additional state info is ``0`` it must be 6857db96d56Sopenharmony_ci possible to set the decoder to the state which has no input buffered and 6867db96d56Sopenharmony_ci ``0`` as the additional state info, so that feeding the previously 6877db96d56Sopenharmony_ci buffered input to the decoder returns it to the previous state without 6887db96d56Sopenharmony_ci producing any output. (Additional state info that is more complicated than 6897db96d56Sopenharmony_ci integers can be converted into an integer by marshaling/pickling the info 6907db96d56Sopenharmony_ci and encoding the bytes of the resulting string into an integer.) 6917db96d56Sopenharmony_ci 6927db96d56Sopenharmony_ci 6937db96d56Sopenharmony_ci .. method:: setstate(state) 6947db96d56Sopenharmony_ci 6957db96d56Sopenharmony_ci Set the state of the decoder to *state*. *state* must be a decoder state 6967db96d56Sopenharmony_ci returned by :meth:`getstate`. 6977db96d56Sopenharmony_ci 6987db96d56Sopenharmony_ci 6997db96d56Sopenharmony_ciStream Encoding and Decoding 7007db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7017db96d56Sopenharmony_ci 7027db96d56Sopenharmony_ci 7037db96d56Sopenharmony_ciThe :class:`StreamWriter` and :class:`StreamReader` classes provide generic 7047db96d56Sopenharmony_ciworking interfaces which can be used to implement new encoding submodules very 7057db96d56Sopenharmony_cieasily. See :mod:`encodings.utf_8` for an example of how this is done. 7067db96d56Sopenharmony_ci 7077db96d56Sopenharmony_ci 7087db96d56Sopenharmony_ci.. _stream-writer-objects: 7097db96d56Sopenharmony_ci 7107db96d56Sopenharmony_ciStreamWriter Objects 7117db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 7127db96d56Sopenharmony_ci 7137db96d56Sopenharmony_ciThe :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the 7147db96d56Sopenharmony_cifollowing methods which every stream writer must define in order to be 7157db96d56Sopenharmony_cicompatible with the Python codec registry. 7167db96d56Sopenharmony_ci 7177db96d56Sopenharmony_ci 7187db96d56Sopenharmony_ci.. class:: StreamWriter(stream, errors='strict') 7197db96d56Sopenharmony_ci 7207db96d56Sopenharmony_ci Constructor for a :class:`StreamWriter` instance. 7217db96d56Sopenharmony_ci 7227db96d56Sopenharmony_ci All stream writers must provide this constructor interface. They are free to add 7237db96d56Sopenharmony_ci additional keyword arguments, but only the ones defined here are used by the 7247db96d56Sopenharmony_ci Python codec registry. 7257db96d56Sopenharmony_ci 7267db96d56Sopenharmony_ci The *stream* argument must be a file-like object open for writing 7277db96d56Sopenharmony_ci text or binary data, as appropriate for the specific codec. 7287db96d56Sopenharmony_ci 7297db96d56Sopenharmony_ci The :class:`StreamWriter` may implement different error handling schemes by 7307db96d56Sopenharmony_ci providing the *errors* keyword argument. See :ref:`error-handlers` for 7317db96d56Sopenharmony_ci the standard error handlers the underlying stream codec may support. 7327db96d56Sopenharmony_ci 7337db96d56Sopenharmony_ci The *errors* argument will be assigned to an attribute of the same name. 7347db96d56Sopenharmony_ci Assigning to this attribute makes it possible to switch between different error 7357db96d56Sopenharmony_ci handling strategies during the lifetime of the :class:`StreamWriter` object. 7367db96d56Sopenharmony_ci 7377db96d56Sopenharmony_ci .. method:: write(object) 7387db96d56Sopenharmony_ci 7397db96d56Sopenharmony_ci Writes the object's contents encoded to the stream. 7407db96d56Sopenharmony_ci 7417db96d56Sopenharmony_ci 7427db96d56Sopenharmony_ci .. method:: writelines(list) 7437db96d56Sopenharmony_ci 7447db96d56Sopenharmony_ci Writes the concatenated iterable of strings to the stream (possibly by reusing 7457db96d56Sopenharmony_ci the :meth:`write` method). Infinite or 7467db96d56Sopenharmony_ci very large iterables are not supported. The standard bytes-to-bytes codecs 7477db96d56Sopenharmony_ci do not support this method. 7487db96d56Sopenharmony_ci 7497db96d56Sopenharmony_ci 7507db96d56Sopenharmony_ci .. method:: reset() 7517db96d56Sopenharmony_ci 7527db96d56Sopenharmony_ci Resets the codec buffers used for keeping internal state. 7537db96d56Sopenharmony_ci 7547db96d56Sopenharmony_ci Calling this method should ensure that the data on the output is put into 7557db96d56Sopenharmony_ci a clean state that allows appending of new fresh data without having to 7567db96d56Sopenharmony_ci rescan the whole stream to recover state. 7577db96d56Sopenharmony_ci 7587db96d56Sopenharmony_ci 7597db96d56Sopenharmony_ciIn addition to the above methods, the :class:`StreamWriter` must also inherit 7607db96d56Sopenharmony_ciall other methods and attributes from the underlying stream. 7617db96d56Sopenharmony_ci 7627db96d56Sopenharmony_ci 7637db96d56Sopenharmony_ci.. _stream-reader-objects: 7647db96d56Sopenharmony_ci 7657db96d56Sopenharmony_ciStreamReader Objects 7667db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 7677db96d56Sopenharmony_ci 7687db96d56Sopenharmony_ciThe :class:`StreamReader` class is a subclass of :class:`Codec` and defines the 7697db96d56Sopenharmony_cifollowing methods which every stream reader must define in order to be 7707db96d56Sopenharmony_cicompatible with the Python codec registry. 7717db96d56Sopenharmony_ci 7727db96d56Sopenharmony_ci 7737db96d56Sopenharmony_ci.. class:: StreamReader(stream, errors='strict') 7747db96d56Sopenharmony_ci 7757db96d56Sopenharmony_ci Constructor for a :class:`StreamReader` instance. 7767db96d56Sopenharmony_ci 7777db96d56Sopenharmony_ci All stream readers must provide this constructor interface. They are free to add 7787db96d56Sopenharmony_ci additional keyword arguments, but only the ones defined here are used by the 7797db96d56Sopenharmony_ci Python codec registry. 7807db96d56Sopenharmony_ci 7817db96d56Sopenharmony_ci The *stream* argument must be a file-like object open for reading 7827db96d56Sopenharmony_ci text or binary data, as appropriate for the specific codec. 7837db96d56Sopenharmony_ci 7847db96d56Sopenharmony_ci The :class:`StreamReader` may implement different error handling schemes by 7857db96d56Sopenharmony_ci providing the *errors* keyword argument. See :ref:`error-handlers` for 7867db96d56Sopenharmony_ci the standard error handlers the underlying stream codec may support. 7877db96d56Sopenharmony_ci 7887db96d56Sopenharmony_ci The *errors* argument will be assigned to an attribute of the same name. 7897db96d56Sopenharmony_ci Assigning to this attribute makes it possible to switch between different error 7907db96d56Sopenharmony_ci handling strategies during the lifetime of the :class:`StreamReader` object. 7917db96d56Sopenharmony_ci 7927db96d56Sopenharmony_ci The set of allowed values for the *errors* argument can be extended with 7937db96d56Sopenharmony_ci :func:`register_error`. 7947db96d56Sopenharmony_ci 7957db96d56Sopenharmony_ci 7967db96d56Sopenharmony_ci .. method:: read(size=-1, chars=-1, firstline=False) 7977db96d56Sopenharmony_ci 7987db96d56Sopenharmony_ci Decodes data from the stream and returns the resulting object. 7997db96d56Sopenharmony_ci 8007db96d56Sopenharmony_ci The *chars* argument indicates the number of decoded 8017db96d56Sopenharmony_ci code points or bytes to return. The :func:`read` method will 8027db96d56Sopenharmony_ci never return more data than requested, but it might return less, 8037db96d56Sopenharmony_ci if there is not enough available. 8047db96d56Sopenharmony_ci 8057db96d56Sopenharmony_ci The *size* argument indicates the approximate maximum 8067db96d56Sopenharmony_ci number of encoded bytes or code points to read 8077db96d56Sopenharmony_ci for decoding. The decoder can modify this setting as 8087db96d56Sopenharmony_ci appropriate. The default value -1 indicates to read and decode as much as 8097db96d56Sopenharmony_ci possible. This parameter is intended to 8107db96d56Sopenharmony_ci prevent having to decode huge files in one step. 8117db96d56Sopenharmony_ci 8127db96d56Sopenharmony_ci The *firstline* flag indicates that 8137db96d56Sopenharmony_ci it would be sufficient to only return the first 8147db96d56Sopenharmony_ci line, if there are decoding errors on later lines. 8157db96d56Sopenharmony_ci 8167db96d56Sopenharmony_ci The method should use a greedy read strategy meaning that it should read 8177db96d56Sopenharmony_ci as much data as is allowed within the definition of the encoding and the 8187db96d56Sopenharmony_ci given size, e.g. if optional encoding endings or state markers are 8197db96d56Sopenharmony_ci available on the stream, these should be read too. 8207db96d56Sopenharmony_ci 8217db96d56Sopenharmony_ci 8227db96d56Sopenharmony_ci .. method:: readline(size=None, keepends=True) 8237db96d56Sopenharmony_ci 8247db96d56Sopenharmony_ci Read one line from the input stream and return the decoded data. 8257db96d56Sopenharmony_ci 8267db96d56Sopenharmony_ci *size*, if given, is passed as size argument to the stream's 8277db96d56Sopenharmony_ci :meth:`read` method. 8287db96d56Sopenharmony_ci 8297db96d56Sopenharmony_ci If *keepends* is false line-endings will be stripped from the lines 8307db96d56Sopenharmony_ci returned. 8317db96d56Sopenharmony_ci 8327db96d56Sopenharmony_ci 8337db96d56Sopenharmony_ci .. method:: readlines(sizehint=None, keepends=True) 8347db96d56Sopenharmony_ci 8357db96d56Sopenharmony_ci Read all lines available on the input stream and return them as a list of 8367db96d56Sopenharmony_ci lines. 8377db96d56Sopenharmony_ci 8387db96d56Sopenharmony_ci Line-endings are implemented using the codec's :meth:`decode` method and 8397db96d56Sopenharmony_ci are included in the list entries if *keepends* is true. 8407db96d56Sopenharmony_ci 8417db96d56Sopenharmony_ci *sizehint*, if given, is passed as the *size* argument to the stream's 8427db96d56Sopenharmony_ci :meth:`read` method. 8437db96d56Sopenharmony_ci 8447db96d56Sopenharmony_ci 8457db96d56Sopenharmony_ci .. method:: reset() 8467db96d56Sopenharmony_ci 8477db96d56Sopenharmony_ci Resets the codec buffers used for keeping internal state. 8487db96d56Sopenharmony_ci 8497db96d56Sopenharmony_ci Note that no stream repositioning should take place. This method is 8507db96d56Sopenharmony_ci primarily intended to be able to recover from decoding errors. 8517db96d56Sopenharmony_ci 8527db96d56Sopenharmony_ci 8537db96d56Sopenharmony_ciIn addition to the above methods, the :class:`StreamReader` must also inherit 8547db96d56Sopenharmony_ciall other methods and attributes from the underlying stream. 8557db96d56Sopenharmony_ci 8567db96d56Sopenharmony_ci.. _stream-reader-writer: 8577db96d56Sopenharmony_ci 8587db96d56Sopenharmony_ciStreamReaderWriter Objects 8597db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 8607db96d56Sopenharmony_ci 8617db96d56Sopenharmony_ciThe :class:`StreamReaderWriter` is a convenience class that allows wrapping 8627db96d56Sopenharmony_cistreams which work in both read and write modes. 8637db96d56Sopenharmony_ci 8647db96d56Sopenharmony_ciThe design is such that one can use the factory functions returned by the 8657db96d56Sopenharmony_ci:func:`lookup` function to construct the instance. 8667db96d56Sopenharmony_ci 8677db96d56Sopenharmony_ci 8687db96d56Sopenharmony_ci.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict') 8697db96d56Sopenharmony_ci 8707db96d56Sopenharmony_ci Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like 8717db96d56Sopenharmony_ci object. *Reader* and *Writer* must be factory functions or classes providing the 8727db96d56Sopenharmony_ci :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling 8737db96d56Sopenharmony_ci is done in the same way as defined for the stream readers and writers. 8747db96d56Sopenharmony_ci 8757db96d56Sopenharmony_ci:class:`StreamReaderWriter` instances define the combined interfaces of 8767db96d56Sopenharmony_ci:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other 8777db96d56Sopenharmony_cimethods and attributes from the underlying stream. 8787db96d56Sopenharmony_ci 8797db96d56Sopenharmony_ci 8807db96d56Sopenharmony_ci.. _stream-recoder-objects: 8817db96d56Sopenharmony_ci 8827db96d56Sopenharmony_ciStreamRecoder Objects 8837db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 8847db96d56Sopenharmony_ci 8857db96d56Sopenharmony_ciThe :class:`StreamRecoder` translates data from one encoding to another, 8867db96d56Sopenharmony_ciwhich is sometimes useful when dealing with different encoding environments. 8877db96d56Sopenharmony_ci 8887db96d56Sopenharmony_ciThe design is such that one can use the factory functions returned by the 8897db96d56Sopenharmony_ci:func:`lookup` function to construct the instance. 8907db96d56Sopenharmony_ci 8917db96d56Sopenharmony_ci 8927db96d56Sopenharmony_ci.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict') 8937db96d56Sopenharmony_ci 8947db96d56Sopenharmony_ci Creates a :class:`StreamRecoder` instance which implements a two-way conversion: 8957db96d56Sopenharmony_ci *encode* and *decode* work on the frontend — the data visible to 8967db96d56Sopenharmony_ci code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer* 8977db96d56Sopenharmony_ci work on the backend — the data in *stream*. 8987db96d56Sopenharmony_ci 8997db96d56Sopenharmony_ci You can use these objects to do transparent transcodings, e.g., from Latin-1 9007db96d56Sopenharmony_ci to UTF-8 and back. 9017db96d56Sopenharmony_ci 9027db96d56Sopenharmony_ci The *stream* argument must be a file-like object. 9037db96d56Sopenharmony_ci 9047db96d56Sopenharmony_ci The *encode* and *decode* arguments must 9057db96d56Sopenharmony_ci adhere to the :class:`Codec` interface. *Reader* and 9067db96d56Sopenharmony_ci *Writer* must be factory functions or classes providing objects of the 9077db96d56Sopenharmony_ci :class:`StreamReader` and :class:`StreamWriter` interface respectively. 9087db96d56Sopenharmony_ci 9097db96d56Sopenharmony_ci Error handling is done in the same way as defined for the stream readers and 9107db96d56Sopenharmony_ci writers. 9117db96d56Sopenharmony_ci 9127db96d56Sopenharmony_ci 9137db96d56Sopenharmony_ci:class:`StreamRecoder` instances define the combined interfaces of 9147db96d56Sopenharmony_ci:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other 9157db96d56Sopenharmony_cimethods and attributes from the underlying stream. 9167db96d56Sopenharmony_ci 9177db96d56Sopenharmony_ci 9187db96d56Sopenharmony_ci.. _encodings-overview: 9197db96d56Sopenharmony_ci 9207db96d56Sopenharmony_ciEncodings and Unicode 9217db96d56Sopenharmony_ci--------------------- 9227db96d56Sopenharmony_ci 9237db96d56Sopenharmony_ciStrings are stored internally as sequences of code points in 9247db96d56Sopenharmony_cirange ``U+0000``--``U+10FFFF``. (See :pep:`393` for 9257db96d56Sopenharmony_cimore details about the implementation.) 9267db96d56Sopenharmony_ciOnce a string object is used outside of CPU and memory, endianness 9277db96d56Sopenharmony_ciand how these arrays are stored as bytes become an issue. As with other 9287db96d56Sopenharmony_cicodecs, serialising a string into a sequence of bytes is known as *encoding*, 9297db96d56Sopenharmony_ciand recreating the string from the sequence of bytes is known as *decoding*. 9307db96d56Sopenharmony_ciThere are a variety of different text serialisation codecs, which are 9317db96d56Sopenharmony_cicollectivity referred to as :term:`text encodings <text encoding>`. 9327db96d56Sopenharmony_ci 9337db96d56Sopenharmony_ciThe simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps 9347db96d56Sopenharmony_cithe code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string 9357db96d56Sopenharmony_ciobject that contains code points above ``U+00FF`` can't be encoded with this 9367db96d56Sopenharmony_cicodec. Doing so will raise a :exc:`UnicodeEncodeError` that looks 9377db96d56Sopenharmony_cilike the following (although the details of the error message may differ): 9387db96d56Sopenharmony_ci``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in 9397db96d56Sopenharmony_ciposition 3: ordinal not in range(256)``. 9407db96d56Sopenharmony_ci 9417db96d56Sopenharmony_ciThere's another group of encodings (the so called charmap encodings) that choose 9427db96d56Sopenharmony_cia different subset of all Unicode code points and how these code points are 9437db96d56Sopenharmony_cimapped to the bytes ``0x0``--``0xff``. To see how this is done simply open 9447db96d56Sopenharmony_cie.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on 9457db96d56Sopenharmony_ciWindows). There's a string constant with 256 characters that shows you which 9467db96d56Sopenharmony_cicharacter is mapped to which byte value. 9477db96d56Sopenharmony_ci 9487db96d56Sopenharmony_ciAll of these encodings can only encode 256 of the 1114112 code points 9497db96d56Sopenharmony_cidefined in Unicode. A simple and straightforward way that can store each Unicode 9507db96d56Sopenharmony_cicode point, is to store each code point as four consecutive bytes. There are two 9517db96d56Sopenharmony_cipossibilities: store the bytes in big endian or in little endian order. These 9527db96d56Sopenharmony_citwo encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their 9537db96d56Sopenharmony_cidisadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you 9547db96d56Sopenharmony_ciwill always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this 9557db96d56Sopenharmony_ciproblem: bytes will always be in natural endianness. When these bytes are read 9567db96d56Sopenharmony_ciby a CPU with a different endianness, then bytes have to be swapped though. To 9577db96d56Sopenharmony_cibe able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence, 9587db96d56Sopenharmony_cithere's the so called BOM ("Byte Order Mark"). This is the Unicode character 9597db96d56Sopenharmony_ci``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32`` 9607db96d56Sopenharmony_cibyte sequence. The byte swapped version of this character (``0xFFFE``) is an 9617db96d56Sopenharmony_ciillegal character that may not appear in a Unicode text. So when the 9627db96d56Sopenharmony_cifirst character in a ``UTF-16`` or ``UTF-32`` byte sequence 9637db96d56Sopenharmony_ciappears to be a ``U+FFFE`` the bytes have to be swapped on decoding. 9647db96d56Sopenharmony_ciUnfortunately the character ``U+FEFF`` had a second purpose as 9657db96d56Sopenharmony_cia ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow 9667db96d56Sopenharmony_cia word to be split. It can e.g. be used to give hints to a ligature algorithm. 9677db96d56Sopenharmony_ciWith Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been 9687db96d56Sopenharmony_cideprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless 9697db96d56Sopenharmony_ciUnicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM 9707db96d56Sopenharmony_ciit's a device to determine the storage layout of the encoded bytes, and vanishes 9717db96d56Sopenharmony_cionce the byte sequence has been decoded into a string; as a ``ZERO WIDTH 9727db96d56Sopenharmony_ciNO-BREAK SPACE`` it's a normal character that will be decoded like any other. 9737db96d56Sopenharmony_ci 9747db96d56Sopenharmony_ciThere's another encoding that is able to encode the full range of Unicode 9757db96d56Sopenharmony_cicharacters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues 9767db96d56Sopenharmony_ciwith byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two 9777db96d56Sopenharmony_ciparts: marker bits (the most significant bits) and payload bits. The marker bits 9787db96d56Sopenharmony_ciare a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are 9797db96d56Sopenharmony_ciencoded like this (with x being payload bits, which when concatenated give the 9807db96d56Sopenharmony_ciUnicode character): 9817db96d56Sopenharmony_ci 9827db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+ 9837db96d56Sopenharmony_ci| Range | Encoding | 9847db96d56Sopenharmony_ci+===================================+==============================================+ 9857db96d56Sopenharmony_ci| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx | 9867db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+ 9877db96d56Sopenharmony_ci| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx | 9887db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+ 9897db96d56Sopenharmony_ci| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx | 9907db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+ 9917db96d56Sopenharmony_ci| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 9927db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+ 9937db96d56Sopenharmony_ci 9947db96d56Sopenharmony_ciThe least significant bit of the Unicode character is the rightmost x bit. 9957db96d56Sopenharmony_ci 9967db96d56Sopenharmony_ciAs UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in 9977db96d56Sopenharmony_cithe decoded string (even if it's the first character) is treated as a ``ZERO 9987db96d56Sopenharmony_ciWIDTH NO-BREAK SPACE``. 9997db96d56Sopenharmony_ci 10007db96d56Sopenharmony_ciWithout external information it's impossible to reliably determine which 10017db96d56Sopenharmony_ciencoding was used for encoding a string. Each charmap encoding can 10027db96d56Sopenharmony_cidecode any random byte sequence. However that's not possible with UTF-8, as 10037db96d56Sopenharmony_ciUTF-8 byte sequences have a structure that doesn't allow arbitrary byte 10047db96d56Sopenharmony_cisequences. To increase the reliability with which a UTF-8 encoding can be 10057db96d56Sopenharmony_cidetected, Microsoft invented a variant of UTF-8 (that Python calls 10067db96d56Sopenharmony_ci``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters 10077db96d56Sopenharmony_ciis written to the file, a UTF-8 encoded BOM (which looks like this as a byte 10087db96d56Sopenharmony_cisequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable 10097db96d56Sopenharmony_cithat any charmap encoded file starts with these byte values (which would e.g. 10107db96d56Sopenharmony_cimap to 10117db96d56Sopenharmony_ci 10127db96d56Sopenharmony_ci | LATIN SMALL LETTER I WITH DIAERESIS 10137db96d56Sopenharmony_ci | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 10147db96d56Sopenharmony_ci | INVERTED QUESTION MARK 10157db96d56Sopenharmony_ci 10167db96d56Sopenharmony_ciin iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be 10177db96d56Sopenharmony_cicorrectly guessed from the byte sequence. So here the BOM is not used to be able 10187db96d56Sopenharmony_cito determine the byte order used for generating the byte sequence, but as a 10197db96d56Sopenharmony_cisignature that helps in guessing the encoding. On encoding the utf-8-sig codec 10207db96d56Sopenharmony_ciwill write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On 10217db96d56Sopenharmony_cidecoding ``utf-8-sig`` will skip those three bytes if they appear as the first 10227db96d56Sopenharmony_cithree bytes in the file. In UTF-8, the use of the BOM is discouraged and 10237db96d56Sopenharmony_cishould generally be avoided. 10247db96d56Sopenharmony_ci 10257db96d56Sopenharmony_ci 10267db96d56Sopenharmony_ci.. _standard-encodings: 10277db96d56Sopenharmony_ci 10287db96d56Sopenharmony_ciStandard Encodings 10297db96d56Sopenharmony_ci------------------ 10307db96d56Sopenharmony_ci 10317db96d56Sopenharmony_ciPython comes with a number of codecs built-in, either implemented as C functions 10327db96d56Sopenharmony_cior with dictionaries as mapping tables. The following table lists the codecs by 10337db96d56Sopenharmony_ciname, together with a few common aliases, and the languages for which the 10347db96d56Sopenharmony_ciencoding is likely used. Neither the list of aliases nor the list of languages 10357db96d56Sopenharmony_ciis meant to be exhaustive. Notice that spelling alternatives that only differ in 10367db96d56Sopenharmony_cicase or use a hyphen instead of an underscore are also valid aliases; therefore, 10377db96d56Sopenharmony_cie.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec. 10387db96d56Sopenharmony_ci 10397db96d56Sopenharmony_ci.. impl-detail:: 10407db96d56Sopenharmony_ci 10417db96d56Sopenharmony_ci Some common encodings can bypass the codecs lookup machinery to 10427db96d56Sopenharmony_ci improve performance. These optimization opportunities are only 10437db96d56Sopenharmony_ci recognized by CPython for a limited set of (case insensitive) 10447db96d56Sopenharmony_ci aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs 10457db96d56Sopenharmony_ci (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and 10467db96d56Sopenharmony_ci the same using underscores instead of dashes. Using alternative 10477db96d56Sopenharmony_ci aliases for these encodings may result in slower execution. 10487db96d56Sopenharmony_ci 10497db96d56Sopenharmony_ci .. versionchanged:: 3.6 10507db96d56Sopenharmony_ci Optimization opportunity recognized for us-ascii. 10517db96d56Sopenharmony_ci 10527db96d56Sopenharmony_ciMany of the character sets support the same languages. They vary in individual 10537db96d56Sopenharmony_cicharacters (e.g. whether the EURO SIGN is supported or not), and in the 10547db96d56Sopenharmony_ciassignment of characters to code positions. For the European languages in 10557db96d56Sopenharmony_ciparticular, the following variants typically exist: 10567db96d56Sopenharmony_ci 10577db96d56Sopenharmony_ci* an ISO 8859 codeset 10587db96d56Sopenharmony_ci 10597db96d56Sopenharmony_ci* a Microsoft Windows code page, which is typically derived from an 8859 codeset, 10607db96d56Sopenharmony_ci but replaces control characters with additional graphic characters 10617db96d56Sopenharmony_ci 10627db96d56Sopenharmony_ci* an IBM EBCDIC code page 10637db96d56Sopenharmony_ci 10647db96d56Sopenharmony_ci* an IBM PC code page, which is ASCII compatible 10657db96d56Sopenharmony_ci 10667db96d56Sopenharmony_ci.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}| 10677db96d56Sopenharmony_ci 10687db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10697db96d56Sopenharmony_ci| Codec | Aliases | Languages | 10707db96d56Sopenharmony_ci+=================+================================+================================+ 10717db96d56Sopenharmony_ci| ascii | 646, us-ascii | English | 10727db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10737db96d56Sopenharmony_ci| big5 | big5-tw, csbig5 | Traditional Chinese | 10747db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10757db96d56Sopenharmony_ci| big5hkscs | big5-hkscs, hkscs | Traditional Chinese | 10767db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10777db96d56Sopenharmony_ci| cp037 | IBM037, IBM039 | English | 10787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10797db96d56Sopenharmony_ci| cp273 | 273, IBM273, csIBM273 | German | 10807db96d56Sopenharmony_ci| | | | 10817db96d56Sopenharmony_ci| | | .. versionadded:: 3.4 | 10827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10837db96d56Sopenharmony_ci| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew | 10847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10857db96d56Sopenharmony_ci| cp437 | 437, IBM437 | English | 10867db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10877db96d56Sopenharmony_ci| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe | 10887db96d56Sopenharmony_ci| | IBM500 | | 10897db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10907db96d56Sopenharmony_ci| cp720 | | Arabic | 10917db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10927db96d56Sopenharmony_ci| cp737 | | Greek | 10937db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10947db96d56Sopenharmony_ci| cp775 | IBM775 | Baltic languages | 10957db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10967db96d56Sopenharmony_ci| cp850 | 850, IBM850 | Western Europe | 10977db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 10987db96d56Sopenharmony_ci| cp852 | 852, IBM852 | Central and Eastern Europe | 10997db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11007db96d56Sopenharmony_ci| cp855 | 855, IBM855 | Bulgarian, Byelorussian, | 11017db96d56Sopenharmony_ci| | | Macedonian, Russian, Serbian | 11027db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11037db96d56Sopenharmony_ci| cp856 | | Hebrew | 11047db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11057db96d56Sopenharmony_ci| cp857 | 857, IBM857 | Turkish | 11067db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11077db96d56Sopenharmony_ci| cp858 | 858, IBM858 | Western Europe | 11087db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11097db96d56Sopenharmony_ci| cp860 | 860, IBM860 | Portuguese | 11107db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11117db96d56Sopenharmony_ci| cp861 | 861, CP-IS, IBM861 | Icelandic | 11127db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11137db96d56Sopenharmony_ci| cp862 | 862, IBM862 | Hebrew | 11147db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11157db96d56Sopenharmony_ci| cp863 | 863, IBM863 | Canadian | 11167db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11177db96d56Sopenharmony_ci| cp864 | IBM864 | Arabic | 11187db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11197db96d56Sopenharmony_ci| cp865 | 865, IBM865 | Danish, Norwegian | 11207db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11217db96d56Sopenharmony_ci| cp866 | 866, IBM866 | Russian | 11227db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11237db96d56Sopenharmony_ci| cp869 | 869, CP-GR, IBM869 | Greek | 11247db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11257db96d56Sopenharmony_ci| cp874 | | Thai | 11267db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11277db96d56Sopenharmony_ci| cp875 | | Greek | 11287db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11297db96d56Sopenharmony_ci| cp932 | 932, ms932, mskanji, ms-kanji | Japanese | 11307db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11317db96d56Sopenharmony_ci| cp949 | 949, ms949, uhc | Korean | 11327db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11337db96d56Sopenharmony_ci| cp950 | 950, ms950 | Traditional Chinese | 11347db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11357db96d56Sopenharmony_ci| cp1006 | | Urdu | 11367db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11377db96d56Sopenharmony_ci| cp1026 | ibm1026 | Turkish | 11387db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11397db96d56Sopenharmony_ci| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian | 11407db96d56Sopenharmony_ci| | | | 11417db96d56Sopenharmony_ci| | | .. versionadded:: 3.4 | 11427db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11437db96d56Sopenharmony_ci| cp1140 | ibm1140 | Western Europe | 11447db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11457db96d56Sopenharmony_ci| cp1250 | windows-1250 | Central and Eastern Europe | 11467db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11477db96d56Sopenharmony_ci| cp1251 | windows-1251 | Bulgarian, Byelorussian, | 11487db96d56Sopenharmony_ci| | | Macedonian, Russian, Serbian | 11497db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11507db96d56Sopenharmony_ci| cp1252 | windows-1252 | Western Europe | 11517db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11527db96d56Sopenharmony_ci| cp1253 | windows-1253 | Greek | 11537db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11547db96d56Sopenharmony_ci| cp1254 | windows-1254 | Turkish | 11557db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11567db96d56Sopenharmony_ci| cp1255 | windows-1255 | Hebrew | 11577db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11587db96d56Sopenharmony_ci| cp1256 | windows-1256 | Arabic | 11597db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11607db96d56Sopenharmony_ci| cp1257 | windows-1257 | Baltic languages | 11617db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11627db96d56Sopenharmony_ci| cp1258 | windows-1258 | Vietnamese | 11637db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11647db96d56Sopenharmony_ci| euc_jp | eucjp, ujis, u-jis | Japanese | 11657db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11667db96d56Sopenharmony_ci| euc_jis_2004 | jisx0213, eucjis2004 | Japanese | 11677db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11687db96d56Sopenharmony_ci| euc_jisx0213 | eucjisx0213 | Japanese | 11697db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11707db96d56Sopenharmony_ci| euc_kr | euckr, korean, ksc5601, | Korean | 11717db96d56Sopenharmony_ci| | ks_c-5601, ks_c-5601-1987, | | 11727db96d56Sopenharmony_ci| | ksx1001, ks_x-1001 | | 11737db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11747db96d56Sopenharmony_ci| gb2312 | chinese, csiso58gb231280, | Simplified Chinese | 11757db96d56Sopenharmony_ci| | euc-cn, euccn, eucgb2312-cn, | | 11767db96d56Sopenharmony_ci| | gb2312-1980, gb2312-80, | | 11777db96d56Sopenharmony_ci| | iso-ir-58 | | 11787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11797db96d56Sopenharmony_ci| gbk | 936, cp936, ms936 | Unified Chinese | 11807db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11817db96d56Sopenharmony_ci| gb18030 | gb18030-2000 | Unified Chinese | 11827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11837db96d56Sopenharmony_ci| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese | 11847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11857db96d56Sopenharmony_ci| iso2022_jp | csiso2022jp, iso2022jp, | Japanese | 11867db96d56Sopenharmony_ci| | iso-2022-jp | | 11877db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11887db96d56Sopenharmony_ci| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese | 11897db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11907db96d56Sopenharmony_ci| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified | 11917db96d56Sopenharmony_ci| | | Chinese, Western Europe, Greek | 11927db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11937db96d56Sopenharmony_ci| iso2022_jp_2004 | iso2022jp-2004, | Japanese | 11947db96d56Sopenharmony_ci| | iso-2022-jp-2004 | | 11957db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11967db96d56Sopenharmony_ci| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese | 11977db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 11987db96d56Sopenharmony_ci| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese | 11997db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12007db96d56Sopenharmony_ci| iso2022_kr | csiso2022kr, iso2022kr, | Korean | 12017db96d56Sopenharmony_ci| | iso-2022-kr | | 12027db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12037db96d56Sopenharmony_ci| latin_1 | iso-8859-1, iso8859-1, 8859, | Western Europe | 12047db96d56Sopenharmony_ci| | cp819, latin, latin1, L1 | | 12057db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12067db96d56Sopenharmony_ci| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe | 12077db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12087db96d56Sopenharmony_ci| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese | 12097db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12107db96d56Sopenharmony_ci| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages | 12117db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12127db96d56Sopenharmony_ci| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, | 12137db96d56Sopenharmony_ci| | | Macedonian, Russian, Serbian | 12147db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12157db96d56Sopenharmony_ci| iso8859_6 | iso-8859-6, arabic | Arabic | 12167db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12177db96d56Sopenharmony_ci| iso8859_7 | iso-8859-7, greek, greek8 | Greek | 12187db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12197db96d56Sopenharmony_ci| iso8859_8 | iso-8859-8, hebrew | Hebrew | 12207db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12217db96d56Sopenharmony_ci| iso8859_9 | iso-8859-9, latin5, L5 | Turkish | 12227db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12237db96d56Sopenharmony_ci| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages | 12247db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12257db96d56Sopenharmony_ci| iso8859_11 | iso-8859-11, thai | Thai languages | 12267db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12277db96d56Sopenharmony_ci| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages | 12287db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12297db96d56Sopenharmony_ci| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages | 12307db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12317db96d56Sopenharmony_ci| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe | 12327db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12337db96d56Sopenharmony_ci| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe | 12347db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12357db96d56Sopenharmony_ci| johab | cp1361, ms1361 | Korean | 12367db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12377db96d56Sopenharmony_ci| koi8_r | | Russian | 12387db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12397db96d56Sopenharmony_ci| koi8_t | | Tajik | 12407db96d56Sopenharmony_ci| | | | 12417db96d56Sopenharmony_ci| | | .. versionadded:: 3.5 | 12427db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12437db96d56Sopenharmony_ci| koi8_u | | Ukrainian | 12447db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12457db96d56Sopenharmony_ci| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh | 12467db96d56Sopenharmony_ci| | | | 12477db96d56Sopenharmony_ci| | | .. versionadded:: 3.5 | 12487db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12497db96d56Sopenharmony_ci| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, | 12507db96d56Sopenharmony_ci| | | Macedonian, Russian, Serbian | 12517db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12527db96d56Sopenharmony_ci| mac_greek | macgreek | Greek | 12537db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12547db96d56Sopenharmony_ci| mac_iceland | maciceland | Icelandic | 12557db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12567db96d56Sopenharmony_ci| mac_latin2 | maclatin2, maccentraleurope, | Central and Eastern Europe | 12577db96d56Sopenharmony_ci| | mac_centeuro | | 12587db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12597db96d56Sopenharmony_ci| mac_roman | macroman, macintosh | Western Europe | 12607db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12617db96d56Sopenharmony_ci| mac_turkish | macturkish | Turkish | 12627db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12637db96d56Sopenharmony_ci| ptcp154 | csptcp154, pt154, cp154, | Kazakh | 12647db96d56Sopenharmony_ci| | cyrillic-asian | | 12657db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12667db96d56Sopenharmony_ci| shift_jis | csshiftjis, shiftjis, sjis, | Japanese | 12677db96d56Sopenharmony_ci| | s_jis | | 12687db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12697db96d56Sopenharmony_ci| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese | 12707db96d56Sopenharmony_ci| | sjis2004 | | 12717db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12727db96d56Sopenharmony_ci| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese | 12737db96d56Sopenharmony_ci| | s_jisx0213 | | 12747db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12757db96d56Sopenharmony_ci| utf_32 | U32, utf32 | all languages | 12767db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12777db96d56Sopenharmony_ci| utf_32_be | UTF-32BE | all languages | 12787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12797db96d56Sopenharmony_ci| utf_32_le | UTF-32LE | all languages | 12807db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12817db96d56Sopenharmony_ci| utf_16 | U16, utf16 | all languages | 12827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12837db96d56Sopenharmony_ci| utf_16_be | UTF-16BE | all languages | 12847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12857db96d56Sopenharmony_ci| utf_16_le | UTF-16LE | all languages | 12867db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12877db96d56Sopenharmony_ci| utf_7 | U7, unicode-1-1-utf-7 | all languages | 12887db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12897db96d56Sopenharmony_ci| utf_8 | U8, UTF, utf8, cp65001 | all languages | 12907db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12917db96d56Sopenharmony_ci| utf_8_sig | | all languages | 12927db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+ 12937db96d56Sopenharmony_ci 12947db96d56Sopenharmony_ci.. versionchanged:: 3.4 12957db96d56Sopenharmony_ci The utf-16\* and utf-32\* encoders no longer allow surrogate code points 12967db96d56Sopenharmony_ci (``U+D800``--``U+DFFF``) to be encoded. 12977db96d56Sopenharmony_ci The utf-32\* decoders no longer decode 12987db96d56Sopenharmony_ci byte sequences that correspond to surrogate code points. 12997db96d56Sopenharmony_ci 13007db96d56Sopenharmony_ci.. versionchanged:: 3.8 13017db96d56Sopenharmony_ci ``cp65001`` is now an alias to ``utf_8``. 13027db96d56Sopenharmony_ci 13037db96d56Sopenharmony_ci 13047db96d56Sopenharmony_ciPython Specific Encodings 13057db96d56Sopenharmony_ci------------------------- 13067db96d56Sopenharmony_ci 13077db96d56Sopenharmony_ciA number of predefined codecs are specific to Python, so their codec names have 13087db96d56Sopenharmony_cino meaning outside Python. These are listed in the tables below based on the 13097db96d56Sopenharmony_ciexpected input and output types (note that while text encodings are the most 13107db96d56Sopenharmony_cicommon use case for codecs, the underlying codec infrastructure supports 13117db96d56Sopenharmony_ciarbitrary data transforms rather than just text encodings). For asymmetric 13127db96d56Sopenharmony_cicodecs, the stated meaning describes the encoding direction. 13137db96d56Sopenharmony_ci 13147db96d56Sopenharmony_ciText Encodings 13157db96d56Sopenharmony_ci^^^^^^^^^^^^^^ 13167db96d56Sopenharmony_ci 13177db96d56Sopenharmony_ciThe following codecs provide :class:`str` to :class:`bytes` encoding and 13187db96d56Sopenharmony_ci:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text 13197db96d56Sopenharmony_ciencodings. 13207db96d56Sopenharmony_ci 13217db96d56Sopenharmony_ci.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}| 13227db96d56Sopenharmony_ci 13237db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13247db96d56Sopenharmony_ci| Codec | Aliases | Meaning | 13257db96d56Sopenharmony_ci+====================+=========+===========================+ 13267db96d56Sopenharmony_ci| idna | | Implement :rfc:`3490`, | 13277db96d56Sopenharmony_ci| | | see also | 13287db96d56Sopenharmony_ci| | | :mod:`encodings.idna`. | 13297db96d56Sopenharmony_ci| | | Only ``errors='strict'`` | 13307db96d56Sopenharmony_ci| | | is supported. | 13317db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13327db96d56Sopenharmony_ci| mbcs | ansi, | Windows only: Encode the | 13337db96d56Sopenharmony_ci| | dbcs | operand according to the | 13347db96d56Sopenharmony_ci| | | ANSI codepage (CP_ACP). | 13357db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13367db96d56Sopenharmony_ci| oem | | Windows only: Encode the | 13377db96d56Sopenharmony_ci| | | operand according to the | 13387db96d56Sopenharmony_ci| | | OEM codepage (CP_OEMCP). | 13397db96d56Sopenharmony_ci| | | | 13407db96d56Sopenharmony_ci| | | .. versionadded:: 3.6 | 13417db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13427db96d56Sopenharmony_ci| palmos | | Encoding of PalmOS 3.5. | 13437db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13447db96d56Sopenharmony_ci| punycode | | Implement :rfc:`3492`. | 13457db96d56Sopenharmony_ci| | | Stateful codecs are not | 13467db96d56Sopenharmony_ci| | | supported. | 13477db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13487db96d56Sopenharmony_ci| raw_unicode_escape | | Latin-1 encoding with | 13497db96d56Sopenharmony_ci| | | ``\uXXXX`` and | 13507db96d56Sopenharmony_ci| | | ``\UXXXXXXXX`` for other | 13517db96d56Sopenharmony_ci| | | code points. Existing | 13527db96d56Sopenharmony_ci| | | backslashes are not | 13537db96d56Sopenharmony_ci| | | escaped in any way. | 13547db96d56Sopenharmony_ci| | | It is used in the Python | 13557db96d56Sopenharmony_ci| | | pickle protocol. | 13567db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13577db96d56Sopenharmony_ci| undefined | | Raise an exception for | 13587db96d56Sopenharmony_ci| | | all conversions, even | 13597db96d56Sopenharmony_ci| | | empty strings. The error | 13607db96d56Sopenharmony_ci| | | handler is ignored. | 13617db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13627db96d56Sopenharmony_ci| unicode_escape | | Encoding suitable as the | 13637db96d56Sopenharmony_ci| | | contents of a Unicode | 13647db96d56Sopenharmony_ci| | | literal in ASCII-encoded | 13657db96d56Sopenharmony_ci| | | Python source code, | 13667db96d56Sopenharmony_ci| | | except that quotes are | 13677db96d56Sopenharmony_ci| | | not escaped. Decode | 13687db96d56Sopenharmony_ci| | | from Latin-1 source code. | 13697db96d56Sopenharmony_ci| | | Beware that Python source | 13707db96d56Sopenharmony_ci| | | code actually uses UTF-8 | 13717db96d56Sopenharmony_ci| | | by default. | 13727db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 13737db96d56Sopenharmony_ci 13747db96d56Sopenharmony_ci.. versionchanged:: 3.8 13757db96d56Sopenharmony_ci "unicode_internal" codec is removed. 13767db96d56Sopenharmony_ci 13777db96d56Sopenharmony_ci 13787db96d56Sopenharmony_ci.. _binary-transforms: 13797db96d56Sopenharmony_ci 13807db96d56Sopenharmony_ciBinary Transforms 13817db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^ 13827db96d56Sopenharmony_ci 13837db96d56Sopenharmony_ciThe following codecs provide binary transforms: :term:`bytes-like object` 13847db96d56Sopenharmony_cito :class:`bytes` mappings. They are not supported by :meth:`bytes.decode` 13857db96d56Sopenharmony_ci(which only produces :class:`str` output). 13867db96d56Sopenharmony_ci 13877db96d56Sopenharmony_ci 13887db96d56Sopenharmony_ci.. tabularcolumns:: |l|L|L|L| 13897db96d56Sopenharmony_ci 13907db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 13917db96d56Sopenharmony_ci| Codec | Aliases | Meaning | Encoder / decoder | 13927db96d56Sopenharmony_ci+======================+==================+==============================+==============================+ 13937db96d56Sopenharmony_ci| base64_codec [#b64]_ | base64, base_64 | Convert the operand to | :meth:`base64.encodebytes` / | 13947db96d56Sopenharmony_ci| | | multiline MIME base64 (the | :meth:`base64.decodebytes` | 13957db96d56Sopenharmony_ci| | | result always includes a | | 13967db96d56Sopenharmony_ci| | | trailing ``'\n'``). | | 13977db96d56Sopenharmony_ci| | | | | 13987db96d56Sopenharmony_ci| | | .. versionchanged:: 3.4 | | 13997db96d56Sopenharmony_ci| | | accepts any | | 14007db96d56Sopenharmony_ci| | | :term:`bytes-like object` | | 14017db96d56Sopenharmony_ci| | | as input for encoding and | | 14027db96d56Sopenharmony_ci| | | decoding | | 14037db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14047db96d56Sopenharmony_ci| bz2_codec | bz2 | Compress the operand using | :meth:`bz2.compress` / | 14057db96d56Sopenharmony_ci| | | bz2. | :meth:`bz2.decompress` | 14067db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14077db96d56Sopenharmony_ci| hex_codec | hex | Convert the operand to | :meth:`binascii.b2a_hex` / | 14087db96d56Sopenharmony_ci| | | hexadecimal | :meth:`binascii.a2b_hex` | 14097db96d56Sopenharmony_ci| | | representation, with two | | 14107db96d56Sopenharmony_ci| | | digits per byte. | | 14117db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14127db96d56Sopenharmony_ci| quopri_codec | quopri, | Convert the operand to MIME | :meth:`quopri.encode` with | 14137db96d56Sopenharmony_ci| | quotedprintable, | quoted printable. | ``quotetabs=True`` / | 14147db96d56Sopenharmony_ci| | quoted_printable | | :meth:`quopri.decode` | 14157db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14167db96d56Sopenharmony_ci| uu_codec | uu | Convert the operand using | :meth:`uu.encode` / | 14177db96d56Sopenharmony_ci| | | uuencode. | :meth:`uu.decode` | 14187db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14197db96d56Sopenharmony_ci| zlib_codec | zip, zlib | Compress the operand using | :meth:`zlib.compress` / | 14207db96d56Sopenharmony_ci| | | gzip. | :meth:`zlib.decompress` | 14217db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+ 14227db96d56Sopenharmony_ci 14237db96d56Sopenharmony_ci.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`, 14247db96d56Sopenharmony_ci ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for 14257db96d56Sopenharmony_ci decoding 14267db96d56Sopenharmony_ci 14277db96d56Sopenharmony_ci.. versionadded:: 3.2 14287db96d56Sopenharmony_ci Restoration of the binary transforms. 14297db96d56Sopenharmony_ci 14307db96d56Sopenharmony_ci.. versionchanged:: 3.4 14317db96d56Sopenharmony_ci Restoration of the aliases for the binary transforms. 14327db96d56Sopenharmony_ci 14337db96d56Sopenharmony_ci 14347db96d56Sopenharmony_ci.. _text-transforms: 14357db96d56Sopenharmony_ci 14367db96d56Sopenharmony_ciText Transforms 14377db96d56Sopenharmony_ci^^^^^^^^^^^^^^^ 14387db96d56Sopenharmony_ci 14397db96d56Sopenharmony_ciThe following codec provides a text transform: a :class:`str` to :class:`str` 14407db96d56Sopenharmony_cimapping. It is not supported by :meth:`str.encode` (which only produces 14417db96d56Sopenharmony_ci:class:`bytes` output). 14427db96d56Sopenharmony_ci 14437db96d56Sopenharmony_ci.. tabularcolumns:: |l|l|L| 14447db96d56Sopenharmony_ci 14457db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 14467db96d56Sopenharmony_ci| Codec | Aliases | Meaning | 14477db96d56Sopenharmony_ci+====================+=========+===========================+ 14487db96d56Sopenharmony_ci| rot_13 | rot13 | Return the Caesar-cypher | 14497db96d56Sopenharmony_ci| | | encryption of the | 14507db96d56Sopenharmony_ci| | | operand. | 14517db96d56Sopenharmony_ci+--------------------+---------+---------------------------+ 14527db96d56Sopenharmony_ci 14537db96d56Sopenharmony_ci.. versionadded:: 3.2 14547db96d56Sopenharmony_ci Restoration of the ``rot_13`` text transform. 14557db96d56Sopenharmony_ci 14567db96d56Sopenharmony_ci.. versionchanged:: 3.4 14577db96d56Sopenharmony_ci Restoration of the ``rot13`` alias. 14587db96d56Sopenharmony_ci 14597db96d56Sopenharmony_ci 14607db96d56Sopenharmony_ci:mod:`encodings.idna` --- Internationalized Domain Names in Applications 14617db96d56Sopenharmony_ci------------------------------------------------------------------------ 14627db96d56Sopenharmony_ci 14637db96d56Sopenharmony_ci.. module:: encodings.idna 14647db96d56Sopenharmony_ci :synopsis: Internationalized Domain Names implementation 14657db96d56Sopenharmony_ci.. moduleauthor:: Martin v. Löwis 14667db96d56Sopenharmony_ci 14677db96d56Sopenharmony_ciThis module implements :rfc:`3490` (Internationalized Domain Names in 14687db96d56Sopenharmony_ciApplications) and :rfc:`3492` (Nameprep: A Stringprep Profile for 14697db96d56Sopenharmony_ciInternationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding 14707db96d56Sopenharmony_ciand :mod:`stringprep`. 14717db96d56Sopenharmony_ci 14727db96d56Sopenharmony_ciIf you need the IDNA 2008 standard from :rfc:`5891` and :rfc:`5895`, use the 14737db96d56Sopenharmony_cithird-party `idna module <https://pypi.org/project/idna/>`_. 14747db96d56Sopenharmony_ci 14757db96d56Sopenharmony_ciThese RFCs together define a protocol to support non-ASCII characters in domain 14767db96d56Sopenharmony_cinames. A domain name containing non-ASCII characters (such as 14777db96d56Sopenharmony_ci``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding 14787db96d56Sopenharmony_ci(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain 14797db96d56Sopenharmony_ciname is then used in all places where arbitrary characters are not allowed by 14807db96d56Sopenharmony_cithe protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so 14817db96d56Sopenharmony_cion. This conversion is carried out in the application; if possible invisible to 14827db96d56Sopenharmony_cithe user: The application should transparently convert Unicode domain labels to 14837db96d56Sopenharmony_ciIDNA on the wire, and convert back ACE labels to Unicode before presenting them 14847db96d56Sopenharmony_cito the user. 14857db96d56Sopenharmony_ci 14867db96d56Sopenharmony_ciPython supports this conversion in several ways: the ``idna`` codec performs 14877db96d56Sopenharmony_ciconversion between Unicode and ACE, separating an input string into labels 14887db96d56Sopenharmony_cibased on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>` 14897db96d56Sopenharmony_ciand converting each label to ACE as required, and conversely separating an input 14907db96d56Sopenharmony_cibyte string into labels based on the ``.`` separator and converting any ACE 14917db96d56Sopenharmony_cilabels found into unicode. Furthermore, the :mod:`socket` module 14927db96d56Sopenharmony_citransparently converts Unicode host names to ACE, so that applications need not 14937db96d56Sopenharmony_cibe concerned about converting host names themselves when they pass them to the 14947db96d56Sopenharmony_cisocket module. On top of that, modules that have host names as function 14957db96d56Sopenharmony_ciparameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host 14967db96d56Sopenharmony_cinames (:mod:`http.client` then also transparently sends an IDNA hostname in the 14977db96d56Sopenharmony_ci:mailheader:`Host` field if it sends that field at all). 14987db96d56Sopenharmony_ci 14997db96d56Sopenharmony_ciWhen receiving host names from the wire (such as in reverse name lookup), no 15007db96d56Sopenharmony_ciautomatic conversion to Unicode is performed: applications wishing to present 15017db96d56Sopenharmony_cisuch host names to the user should decode them to Unicode. 15027db96d56Sopenharmony_ci 15037db96d56Sopenharmony_ciThe module :mod:`encodings.idna` also implements the nameprep procedure, which 15047db96d56Sopenharmony_ciperforms certain normalizations on host names, to achieve case-insensitivity of 15057db96d56Sopenharmony_ciinternational domain names, and to unify similar characters. The nameprep 15067db96d56Sopenharmony_cifunctions can be used directly if desired. 15077db96d56Sopenharmony_ci 15087db96d56Sopenharmony_ci 15097db96d56Sopenharmony_ci.. function:: nameprep(label) 15107db96d56Sopenharmony_ci 15117db96d56Sopenharmony_ci Return the nameprepped version of *label*. The implementation currently assumes 15127db96d56Sopenharmony_ci query strings, so ``AllowUnassigned`` is true. 15137db96d56Sopenharmony_ci 15147db96d56Sopenharmony_ci 15157db96d56Sopenharmony_ci.. function:: ToASCII(label) 15167db96d56Sopenharmony_ci 15177db96d56Sopenharmony_ci Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is 15187db96d56Sopenharmony_ci assumed to be false. 15197db96d56Sopenharmony_ci 15207db96d56Sopenharmony_ci 15217db96d56Sopenharmony_ci.. function:: ToUnicode(label) 15227db96d56Sopenharmony_ci 15237db96d56Sopenharmony_ci Convert a label to Unicode, as specified in :rfc:`3490`. 15247db96d56Sopenharmony_ci 15257db96d56Sopenharmony_ci 15267db96d56Sopenharmony_ci:mod:`encodings.mbcs` --- Windows ANSI codepage 15277db96d56Sopenharmony_ci----------------------------------------------- 15287db96d56Sopenharmony_ci 15297db96d56Sopenharmony_ci.. module:: encodings.mbcs 15307db96d56Sopenharmony_ci :synopsis: Windows ANSI codepage 15317db96d56Sopenharmony_ci 15327db96d56Sopenharmony_ciThis module implements the ANSI codepage (CP_ACP). 15337db96d56Sopenharmony_ci 15347db96d56Sopenharmony_ci.. availability:: Windows. 15357db96d56Sopenharmony_ci 15367db96d56Sopenharmony_ci.. versionchanged:: 3.3 15377db96d56Sopenharmony_ci Support any error handler. 15387db96d56Sopenharmony_ci 15397db96d56Sopenharmony_ci.. versionchanged:: 3.2 15407db96d56Sopenharmony_ci Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used 15417db96d56Sopenharmony_ci to encode, and ``'ignore'`` to decode. 15427db96d56Sopenharmony_ci 15437db96d56Sopenharmony_ci 15447db96d56Sopenharmony_ci:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature 15457db96d56Sopenharmony_ci------------------------------------------------------------- 15467db96d56Sopenharmony_ci 15477db96d56Sopenharmony_ci.. module:: encodings.utf_8_sig 15487db96d56Sopenharmony_ci :synopsis: UTF-8 codec with BOM signature 15497db96d56Sopenharmony_ci.. moduleauthor:: Walter Dörwald 15507db96d56Sopenharmony_ci 15517db96d56Sopenharmony_ciThis module implements a variant of the UTF-8 codec. On encoding, a UTF-8 encoded 15527db96d56Sopenharmony_ciBOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this 15537db96d56Sopenharmony_ciis only done once (on the first write to the byte stream). On decoding, an 15547db96d56Sopenharmony_cioptional UTF-8 encoded BOM at the start of the data will be skipped. 1555