17db96d56Sopenharmony_ci:mod:`codecs` --- Codec registry and base classes
27db96d56Sopenharmony_ci=================================================
37db96d56Sopenharmony_ci
47db96d56Sopenharmony_ci.. module:: codecs
57db96d56Sopenharmony_ci   :synopsis: Encode and decode data and streams.
67db96d56Sopenharmony_ci
77db96d56Sopenharmony_ci.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
87db96d56Sopenharmony_ci.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
97db96d56Sopenharmony_ci.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
107db96d56Sopenharmony_ci
117db96d56Sopenharmony_ci**Source code:** :source:`Lib/codecs.py`
127db96d56Sopenharmony_ci
137db96d56Sopenharmony_ci.. index::
147db96d56Sopenharmony_ci   single: Unicode
157db96d56Sopenharmony_ci   single: Codecs
167db96d56Sopenharmony_ci   pair: Codecs; encode
177db96d56Sopenharmony_ci   pair: Codecs; decode
187db96d56Sopenharmony_ci   single: streams
197db96d56Sopenharmony_ci   pair: stackable; streams
207db96d56Sopenharmony_ci
217db96d56Sopenharmony_ci--------------
227db96d56Sopenharmony_ci
237db96d56Sopenharmony_ciThis module defines base classes for standard Python codecs (encoders and
247db96d56Sopenharmony_cidecoders) and provides access to the internal Python codec registry, which
257db96d56Sopenharmony_cimanages the codec and error handling lookup process. Most standard codecs
267db96d56Sopenharmony_ciare :term:`text encodings <text encoding>`, which encode text to bytes (and
277db96d56Sopenharmony_cidecode bytes to text), but there are also codecs provided that encode text to
287db96d56Sopenharmony_citext, and bytes to bytes. Custom codecs may encode and decode between arbitrary
297db96d56Sopenharmony_citypes, but some module features are restricted to be used specifically with
307db96d56Sopenharmony_ci:term:`text encodings <text encoding>` or with codecs that encode to
317db96d56Sopenharmony_ci:class:`bytes`.
327db96d56Sopenharmony_ci
337db96d56Sopenharmony_ciThe module defines the following functions for encoding and decoding with
347db96d56Sopenharmony_ciany codec:
357db96d56Sopenharmony_ci
367db96d56Sopenharmony_ci.. function:: encode(obj, encoding='utf-8', errors='strict')
377db96d56Sopenharmony_ci
387db96d56Sopenharmony_ci   Encodes *obj* using the codec registered for *encoding*.
397db96d56Sopenharmony_ci
407db96d56Sopenharmony_ci   *Errors* may be given to set the desired error handling scheme. The
417db96d56Sopenharmony_ci   default error handler is ``'strict'`` meaning that encoding errors raise
427db96d56Sopenharmony_ci   :exc:`ValueError` (or a more codec specific subclass, such as
437db96d56Sopenharmony_ci   :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
447db96d56Sopenharmony_ci   information on codec error handling.
457db96d56Sopenharmony_ci
467db96d56Sopenharmony_ci.. function:: decode(obj, encoding='utf-8', errors='strict')
477db96d56Sopenharmony_ci
487db96d56Sopenharmony_ci   Decodes *obj* using the codec registered for *encoding*.
497db96d56Sopenharmony_ci
507db96d56Sopenharmony_ci   *Errors* may be given to set the desired error handling scheme. The
517db96d56Sopenharmony_ci   default error handler is ``'strict'`` meaning that decoding errors raise
527db96d56Sopenharmony_ci   :exc:`ValueError` (or a more codec specific subclass, such as
537db96d56Sopenharmony_ci   :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
547db96d56Sopenharmony_ci   information on codec error handling.
557db96d56Sopenharmony_ci
567db96d56Sopenharmony_ciThe full details for each codec can also be looked up directly:
577db96d56Sopenharmony_ci
587db96d56Sopenharmony_ci.. function:: lookup(encoding)
597db96d56Sopenharmony_ci
607db96d56Sopenharmony_ci   Looks up the codec info in the Python codec registry and returns a
617db96d56Sopenharmony_ci   :class:`CodecInfo` object as defined below.
627db96d56Sopenharmony_ci
637db96d56Sopenharmony_ci   Encodings are first looked up in the registry's cache. If not found, the list of
647db96d56Sopenharmony_ci   registered search functions is scanned. If no :class:`CodecInfo` object is
657db96d56Sopenharmony_ci   found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
667db96d56Sopenharmony_ci   is stored in the cache and returned to the caller.
677db96d56Sopenharmony_ci
687db96d56Sopenharmony_ci.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
697db96d56Sopenharmony_ci
707db96d56Sopenharmony_ci   Codec details when looking up the codec registry. The constructor
717db96d56Sopenharmony_ci   arguments are stored in attributes of the same name:
727db96d56Sopenharmony_ci
737db96d56Sopenharmony_ci
747db96d56Sopenharmony_ci   .. attribute:: name
757db96d56Sopenharmony_ci
767db96d56Sopenharmony_ci      The name of the encoding.
777db96d56Sopenharmony_ci
787db96d56Sopenharmony_ci
797db96d56Sopenharmony_ci   .. attribute:: encode
807db96d56Sopenharmony_ci                  decode
817db96d56Sopenharmony_ci
827db96d56Sopenharmony_ci      The stateless encoding and decoding functions. These must be
837db96d56Sopenharmony_ci      functions or methods which have the same interface as
847db96d56Sopenharmony_ci      the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec
857db96d56Sopenharmony_ci      instances (see :ref:`Codec Interface <codec-objects>`).
867db96d56Sopenharmony_ci      The functions or methods are expected to work in a stateless mode.
877db96d56Sopenharmony_ci
887db96d56Sopenharmony_ci
897db96d56Sopenharmony_ci   .. attribute:: incrementalencoder
907db96d56Sopenharmony_ci                  incrementaldecoder
917db96d56Sopenharmony_ci
927db96d56Sopenharmony_ci      Incremental encoder and decoder classes or factory functions.
937db96d56Sopenharmony_ci      These have to provide the interface defined by the base classes
947db96d56Sopenharmony_ci      :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
957db96d56Sopenharmony_ci      respectively. Incremental codecs can maintain state.
967db96d56Sopenharmony_ci
977db96d56Sopenharmony_ci
987db96d56Sopenharmony_ci   .. attribute:: streamwriter
997db96d56Sopenharmony_ci                  streamreader
1007db96d56Sopenharmony_ci
1017db96d56Sopenharmony_ci      Stream writer and reader classes or factory functions. These have to
1027db96d56Sopenharmony_ci      provide the interface defined by the base classes
1037db96d56Sopenharmony_ci      :class:`StreamWriter` and :class:`StreamReader`, respectively.
1047db96d56Sopenharmony_ci      Stream codecs can maintain state.
1057db96d56Sopenharmony_ci
1067db96d56Sopenharmony_ciTo simplify access to the various codec components, the module provides
1077db96d56Sopenharmony_cithese additional functions which use :func:`lookup` for the codec lookup:
1087db96d56Sopenharmony_ci
1097db96d56Sopenharmony_ci.. function:: getencoder(encoding)
1107db96d56Sopenharmony_ci
1117db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its encoder function.
1127db96d56Sopenharmony_ci
1137db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found.
1147db96d56Sopenharmony_ci
1157db96d56Sopenharmony_ci
1167db96d56Sopenharmony_ci.. function:: getdecoder(encoding)
1177db96d56Sopenharmony_ci
1187db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its decoder function.
1197db96d56Sopenharmony_ci
1207db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found.
1217db96d56Sopenharmony_ci
1227db96d56Sopenharmony_ci
1237db96d56Sopenharmony_ci.. function:: getincrementalencoder(encoding)
1247db96d56Sopenharmony_ci
1257db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its incremental encoder
1267db96d56Sopenharmony_ci   class or factory function.
1277db96d56Sopenharmony_ci
1287db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
1297db96d56Sopenharmony_ci   doesn't support an incremental encoder.
1307db96d56Sopenharmony_ci
1317db96d56Sopenharmony_ci
1327db96d56Sopenharmony_ci.. function:: getincrementaldecoder(encoding)
1337db96d56Sopenharmony_ci
1347db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its incremental decoder
1357db96d56Sopenharmony_ci   class or factory function.
1367db96d56Sopenharmony_ci
1377db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
1387db96d56Sopenharmony_ci   doesn't support an incremental decoder.
1397db96d56Sopenharmony_ci
1407db96d56Sopenharmony_ci
1417db96d56Sopenharmony_ci.. function:: getreader(encoding)
1427db96d56Sopenharmony_ci
1437db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its :class:`StreamReader`
1447db96d56Sopenharmony_ci   class or factory function.
1457db96d56Sopenharmony_ci
1467db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found.
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci
1497db96d56Sopenharmony_ci.. function:: getwriter(encoding)
1507db96d56Sopenharmony_ci
1517db96d56Sopenharmony_ci   Look up the codec for the given encoding and return its :class:`StreamWriter`
1527db96d56Sopenharmony_ci   class or factory function.
1537db96d56Sopenharmony_ci
1547db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the encoding cannot be found.
1557db96d56Sopenharmony_ci
1567db96d56Sopenharmony_ciCustom codecs are made available by registering a suitable codec search
1577db96d56Sopenharmony_cifunction:
1587db96d56Sopenharmony_ci
1597db96d56Sopenharmony_ci.. function:: register(search_function)
1607db96d56Sopenharmony_ci
1617db96d56Sopenharmony_ci   Register a codec search function. Search functions are expected to take one
1627db96d56Sopenharmony_ci   argument, being the encoding name in all lower case letters with hyphens
1637db96d56Sopenharmony_ci   and spaces converted to underscores, and return a :class:`CodecInfo` object.
1647db96d56Sopenharmony_ci   In case a search function cannot find a given encoding, it should return
1657db96d56Sopenharmony_ci   ``None``.
1667db96d56Sopenharmony_ci
1677db96d56Sopenharmony_ci   .. versionchanged:: 3.9
1687db96d56Sopenharmony_ci      Hyphens and spaces are converted to underscore.
1697db96d56Sopenharmony_ci
1707db96d56Sopenharmony_ci
1717db96d56Sopenharmony_ci.. function:: unregister(search_function)
1727db96d56Sopenharmony_ci
1737db96d56Sopenharmony_ci   Unregister a codec search function and clear the registry's cache.
1747db96d56Sopenharmony_ci   If the search function is not registered, do nothing.
1757db96d56Sopenharmony_ci
1767db96d56Sopenharmony_ci   .. versionadded:: 3.10
1777db96d56Sopenharmony_ci
1787db96d56Sopenharmony_ci
1797db96d56Sopenharmony_ciWhile the builtin :func:`open` and the associated :mod:`io` module are the
1807db96d56Sopenharmony_cirecommended approach for working with encoded text files, this module
1817db96d56Sopenharmony_ciprovides additional utility functions and classes that allow the use of a
1827db96d56Sopenharmony_ciwider range of codecs when working with binary files:
1837db96d56Sopenharmony_ci
1847db96d56Sopenharmony_ci.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=-1)
1857db96d56Sopenharmony_ci
1867db96d56Sopenharmony_ci   Open an encoded file using the given *mode* and return an instance of
1877db96d56Sopenharmony_ci   :class:`StreamReaderWriter`, providing transparent encoding/decoding.
1887db96d56Sopenharmony_ci   The default file mode is ``'r'``, meaning to open the file in read mode.
1897db96d56Sopenharmony_ci
1907db96d56Sopenharmony_ci   .. note::
1917db96d56Sopenharmony_ci
1927db96d56Sopenharmony_ci      If *encoding* is not ``None``, then the
1937db96d56Sopenharmony_ci      underlying encoded files are always opened in binary mode.
1947db96d56Sopenharmony_ci      No automatic conversion of ``'\n'`` is done on reading and writing.
1957db96d56Sopenharmony_ci      The *mode* argument may be any binary mode acceptable to the built-in
1967db96d56Sopenharmony_ci      :func:`open` function; the ``'b'`` is automatically added.
1977db96d56Sopenharmony_ci
1987db96d56Sopenharmony_ci   *encoding* specifies the encoding which is to be used for the file.
1997db96d56Sopenharmony_ci   Any encoding that encodes to and decodes from bytes is allowed, and
2007db96d56Sopenharmony_ci   the data types supported by the file methods depend on the codec used.
2017db96d56Sopenharmony_ci
2027db96d56Sopenharmony_ci   *errors* may be given to define the error handling. It defaults to ``'strict'``
2037db96d56Sopenharmony_ci   which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
2047db96d56Sopenharmony_ci
2057db96d56Sopenharmony_ci   *buffering* has the same meaning as for the built-in :func:`open` function.
2067db96d56Sopenharmony_ci   It defaults to -1 which means that the default buffer size will be used.
2077db96d56Sopenharmony_ci
2087db96d56Sopenharmony_ci   .. versionchanged:: 3.11
2097db96d56Sopenharmony_ci      The ``'U'`` mode has been removed.
2107db96d56Sopenharmony_ci
2117db96d56Sopenharmony_ci
2127db96d56Sopenharmony_ci.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')
2137db96d56Sopenharmony_ci
2147db96d56Sopenharmony_ci   Return a :class:`StreamRecoder` instance, a wrapped version of *file*
2157db96d56Sopenharmony_ci   which provides transparent transcoding. The original file is closed
2167db96d56Sopenharmony_ci   when the wrapped version is closed.
2177db96d56Sopenharmony_ci
2187db96d56Sopenharmony_ci   Data written to the wrapped file is decoded according to the given
2197db96d56Sopenharmony_ci   *data_encoding* and then written to the original file as bytes using
2207db96d56Sopenharmony_ci   *file_encoding*. Bytes read from the original file are decoded
2217db96d56Sopenharmony_ci   according to *file_encoding*, and the result is encoded
2227db96d56Sopenharmony_ci   using *data_encoding*.
2237db96d56Sopenharmony_ci
2247db96d56Sopenharmony_ci   If *file_encoding* is not given, it defaults to *data_encoding*.
2257db96d56Sopenharmony_ci
2267db96d56Sopenharmony_ci   *errors* may be given to define the error handling. It defaults to
2277db96d56Sopenharmony_ci   ``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding
2287db96d56Sopenharmony_ci   error occurs.
2297db96d56Sopenharmony_ci
2307db96d56Sopenharmony_ci
2317db96d56Sopenharmony_ci.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)
2327db96d56Sopenharmony_ci
2337db96d56Sopenharmony_ci   Uses an incremental encoder to iteratively encode the input provided by
2347db96d56Sopenharmony_ci   *iterator*. This function is a :term:`generator`.
2357db96d56Sopenharmony_ci   The *errors* argument (as well as any
2367db96d56Sopenharmony_ci   other keyword argument) is passed through to the incremental encoder.
2377db96d56Sopenharmony_ci
2387db96d56Sopenharmony_ci   This function requires that the codec accept text :class:`str` objects
2397db96d56Sopenharmony_ci   to encode. Therefore it does not support bytes-to-bytes encoders such as
2407db96d56Sopenharmony_ci   ``base64_codec``.
2417db96d56Sopenharmony_ci
2427db96d56Sopenharmony_ci
2437db96d56Sopenharmony_ci.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)
2447db96d56Sopenharmony_ci
2457db96d56Sopenharmony_ci   Uses an incremental decoder to iteratively decode the input provided by
2467db96d56Sopenharmony_ci   *iterator*. This function is a :term:`generator`.
2477db96d56Sopenharmony_ci   The *errors* argument (as well as any
2487db96d56Sopenharmony_ci   other keyword argument) is passed through to the incremental decoder.
2497db96d56Sopenharmony_ci
2507db96d56Sopenharmony_ci   This function requires that the codec accept :class:`bytes` objects
2517db96d56Sopenharmony_ci   to decode. Therefore it does not support text-to-text encoders such as
2527db96d56Sopenharmony_ci   ``rot_13``, although ``rot_13`` may be used equivalently with
2537db96d56Sopenharmony_ci   :func:`iterencode`.
2547db96d56Sopenharmony_ci
2557db96d56Sopenharmony_ci
2567db96d56Sopenharmony_ciThe module also provides the following constants which are useful for reading
2577db96d56Sopenharmony_ciand writing to platform dependent files:
2587db96d56Sopenharmony_ci
2597db96d56Sopenharmony_ci
2607db96d56Sopenharmony_ci.. data:: BOM
2617db96d56Sopenharmony_ci          BOM_BE
2627db96d56Sopenharmony_ci          BOM_LE
2637db96d56Sopenharmony_ci          BOM_UTF8
2647db96d56Sopenharmony_ci          BOM_UTF16
2657db96d56Sopenharmony_ci          BOM_UTF16_BE
2667db96d56Sopenharmony_ci          BOM_UTF16_LE
2677db96d56Sopenharmony_ci          BOM_UTF32
2687db96d56Sopenharmony_ci          BOM_UTF32_BE
2697db96d56Sopenharmony_ci          BOM_UTF32_LE
2707db96d56Sopenharmony_ci
2717db96d56Sopenharmony_ci   These constants define various byte sequences,
2727db96d56Sopenharmony_ci   being Unicode byte order marks (BOMs) for several encodings. They are
2737db96d56Sopenharmony_ci   used in UTF-16 and UTF-32 data streams to indicate the byte order used,
2747db96d56Sopenharmony_ci   and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
2757db96d56Sopenharmony_ci   :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
2767db96d56Sopenharmony_ci   native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
2777db96d56Sopenharmony_ci   :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
2787db96d56Sopenharmony_ci   :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
2797db96d56Sopenharmony_ci   encodings.
2807db96d56Sopenharmony_ci
2817db96d56Sopenharmony_ci
2827db96d56Sopenharmony_ci.. _codec-base-classes:
2837db96d56Sopenharmony_ci
2847db96d56Sopenharmony_ciCodec Base Classes
2857db96d56Sopenharmony_ci------------------
2867db96d56Sopenharmony_ci
2877db96d56Sopenharmony_ciThe :mod:`codecs` module defines a set of base classes which define the
2887db96d56Sopenharmony_ciinterfaces for working with codec objects, and can also be used as the basis
2897db96d56Sopenharmony_cifor custom codec implementations.
2907db96d56Sopenharmony_ci
2917db96d56Sopenharmony_ciEach codec has to define four interfaces to make it usable as codec in Python:
2927db96d56Sopenharmony_cistateless encoder, stateless decoder, stream reader and stream writer. The
2937db96d56Sopenharmony_cistream reader and writers typically reuse the stateless encoder/decoder to
2947db96d56Sopenharmony_ciimplement the file protocols. Codec authors also need to define how the
2957db96d56Sopenharmony_cicodec will handle encoding and decoding errors.
2967db96d56Sopenharmony_ci
2977db96d56Sopenharmony_ci
2987db96d56Sopenharmony_ci.. _surrogateescape:
2997db96d56Sopenharmony_ci.. _error-handlers:
3007db96d56Sopenharmony_ci
3017db96d56Sopenharmony_ciError Handlers
3027db96d56Sopenharmony_ci^^^^^^^^^^^^^^
3037db96d56Sopenharmony_ci
3047db96d56Sopenharmony_ciTo simplify and standardize error handling, codecs may implement different
3057db96d56Sopenharmony_cierror handling schemes by accepting the *errors* string argument:
3067db96d56Sopenharmony_ci
3077db96d56Sopenharmony_ci      >>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace')
3087db96d56Sopenharmony_ci      b'German \\xdf, \\u266c'
3097db96d56Sopenharmony_ci      >>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace')
3107db96d56Sopenharmony_ci      b'German &#223;, &#9836;'
3117db96d56Sopenharmony_ci
3127db96d56Sopenharmony_ci.. index::
3137db96d56Sopenharmony_ci   pair: strict; error handler's name
3147db96d56Sopenharmony_ci   pair: ignore; error handler's name
3157db96d56Sopenharmony_ci   pair: replace; error handler's name
3167db96d56Sopenharmony_ci   pair: backslashreplace; error handler's name
3177db96d56Sopenharmony_ci   pair: surrogateescape; error handler's name
3187db96d56Sopenharmony_ci   single: ? (question mark); replacement character
3197db96d56Sopenharmony_ci   single: \ (backslash); escape sequence
3207db96d56Sopenharmony_ci   single: \x; escape sequence
3217db96d56Sopenharmony_ci   single: \u; escape sequence
3227db96d56Sopenharmony_ci   single: \U; escape sequence
3237db96d56Sopenharmony_ci
3247db96d56Sopenharmony_ciThe following error handlers can be used with all Python
3257db96d56Sopenharmony_ci:ref:`standard-encodings` codecs:
3267db96d56Sopenharmony_ci
3277db96d56Sopenharmony_ci.. tabularcolumns:: |l|L|
3287db96d56Sopenharmony_ci
3297db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3307db96d56Sopenharmony_ci| Value                   | Meaning                                       |
3317db96d56Sopenharmony_ci+=========================+===============================================+
3327db96d56Sopenharmony_ci| ``'strict'``            | Raise :exc:`UnicodeError` (or a subclass),    |
3337db96d56Sopenharmony_ci|                         | this is the default. Implemented in           |
3347db96d56Sopenharmony_ci|                         | :func:`strict_errors`.                        |
3357db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3367db96d56Sopenharmony_ci| ``'ignore'``            | Ignore the malformed data and continue without|
3377db96d56Sopenharmony_ci|                         | further notice. Implemented in                |
3387db96d56Sopenharmony_ci|                         | :func:`ignore_errors`.                        |
3397db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3407db96d56Sopenharmony_ci| ``'replace'``           | Replace with a replacement marker. On         |
3417db96d56Sopenharmony_ci|                         | encoding, use ``?`` (ASCII character). On     |
3427db96d56Sopenharmony_ci|                         | decoding, use ``�`` (U+FFFD, the official     |
3437db96d56Sopenharmony_ci|                         | REPLACEMENT CHARACTER). Implemented in        |
3447db96d56Sopenharmony_ci|                         | :func:`replace_errors`.                       |
3457db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3467db96d56Sopenharmony_ci| ``'backslashreplace'``  | Replace with backslashed escape sequences.    |
3477db96d56Sopenharmony_ci|                         | On encoding, use hexadecimal form of Unicode  |
3487db96d56Sopenharmony_ci|                         | code point with formats ``\xhh`` ``\uxxxx``   |
3497db96d56Sopenharmony_ci|                         | ``\Uxxxxxxxx``. On decoding, use hexadecimal  |
3507db96d56Sopenharmony_ci|                         | form of byte value with format ``\xhh``.      |
3517db96d56Sopenharmony_ci|                         | Implemented in                                |
3527db96d56Sopenharmony_ci|                         | :func:`backslashreplace_errors`.              |
3537db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3547db96d56Sopenharmony_ci| ``'surrogateescape'``   | On decoding, replace byte with individual     |
3557db96d56Sopenharmony_ci|                         | surrogate code ranging from ``U+DC80`` to     |
3567db96d56Sopenharmony_ci|                         | ``U+DCFF``. This code will then be turned     |
3577db96d56Sopenharmony_ci|                         | back into the same byte when the              |
3587db96d56Sopenharmony_ci|                         | ``'surrogateescape'`` error handler is used   |
3597db96d56Sopenharmony_ci|                         | when encoding the data. (See :pep:`383` for   |
3607db96d56Sopenharmony_ci|                         | more.)                                        |
3617db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3627db96d56Sopenharmony_ci
3637db96d56Sopenharmony_ci.. index::
3647db96d56Sopenharmony_ci   pair: xmlcharrefreplace; error handler's name
3657db96d56Sopenharmony_ci   pair: namereplace; error handler's name
3667db96d56Sopenharmony_ci   single: \N; escape sequence
3677db96d56Sopenharmony_ci
3687db96d56Sopenharmony_ciThe following error handlers are only applicable to encoding (within
3697db96d56Sopenharmony_ci:term:`text encodings <text encoding>`):
3707db96d56Sopenharmony_ci
3717db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3727db96d56Sopenharmony_ci| Value                   | Meaning                                       |
3737db96d56Sopenharmony_ci+=========================+===============================================+
3747db96d56Sopenharmony_ci| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character       |
3757db96d56Sopenharmony_ci|                         | reference, which is a decimal form of Unicode |
3767db96d56Sopenharmony_ci|                         | code point with format ``&#num;`` Implemented |
3777db96d56Sopenharmony_ci|                         | in :func:`xmlcharrefreplace_errors`.          |
3787db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3797db96d56Sopenharmony_ci| ``'namereplace'``       | Replace with ``\N{...}`` escape sequences,    |
3807db96d56Sopenharmony_ci|                         | what appears in the braces is the Name        |
3817db96d56Sopenharmony_ci|                         | property from Unicode Character Database.     |
3827db96d56Sopenharmony_ci|                         | Implemented in :func:`namereplace_errors`.    |
3837db96d56Sopenharmony_ci+-------------------------+-----------------------------------------------+
3847db96d56Sopenharmony_ci
3857db96d56Sopenharmony_ci.. index::
3867db96d56Sopenharmony_ci   pair: surrogatepass; error handler's name
3877db96d56Sopenharmony_ci
3887db96d56Sopenharmony_ciIn addition, the following error handler is specific to the given codecs:
3897db96d56Sopenharmony_ci
3907db96d56Sopenharmony_ci+-------------------+------------------------+-------------------------------------------+
3917db96d56Sopenharmony_ci| Value             | Codecs                 | Meaning                                   |
3927db96d56Sopenharmony_ci+===================+========================+===========================================+
3937db96d56Sopenharmony_ci|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code|
3947db96d56Sopenharmony_ci|                   | utf-16-be, utf-16-le,  | point (``U+D800`` - ``U+DFFF``) as normal |
3957db96d56Sopenharmony_ci|                   | utf-32-be, utf-32-le   | code point. Otherwise these codecs treat  |
3967db96d56Sopenharmony_ci|                   |                        | the presence of surrogate code point in   |
3977db96d56Sopenharmony_ci|                   |                        | :class:`str` as an error.                 |
3987db96d56Sopenharmony_ci+-------------------+------------------------+-------------------------------------------+
3997db96d56Sopenharmony_ci
4007db96d56Sopenharmony_ci.. versionadded:: 3.1
4017db96d56Sopenharmony_ci   The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
4027db96d56Sopenharmony_ci
4037db96d56Sopenharmony_ci.. versionchanged:: 3.4
4047db96d56Sopenharmony_ci   The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\*
4057db96d56Sopenharmony_ci   codecs.
4067db96d56Sopenharmony_ci
4077db96d56Sopenharmony_ci.. versionadded:: 3.5
4087db96d56Sopenharmony_ci   The ``'namereplace'`` error handler.
4097db96d56Sopenharmony_ci
4107db96d56Sopenharmony_ci.. versionchanged:: 3.5
4117db96d56Sopenharmony_ci   The ``'backslashreplace'`` error handler now works with decoding and
4127db96d56Sopenharmony_ci   translating.
4137db96d56Sopenharmony_ci
4147db96d56Sopenharmony_ciThe set of allowed values can be extended by registering a new named error
4157db96d56Sopenharmony_cihandler:
4167db96d56Sopenharmony_ci
4177db96d56Sopenharmony_ci.. function:: register_error(name, error_handler)
4187db96d56Sopenharmony_ci
4197db96d56Sopenharmony_ci   Register the error handling function *error_handler* under the name *name*.
4207db96d56Sopenharmony_ci   The *error_handler* argument will be called during encoding and decoding
4217db96d56Sopenharmony_ci   in case of an error, when *name* is specified as the errors parameter.
4227db96d56Sopenharmony_ci
4237db96d56Sopenharmony_ci   For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`
4247db96d56Sopenharmony_ci   instance, which contains information about the location of the error. The
4257db96d56Sopenharmony_ci   error handler must either raise this or a different exception, or return a
4267db96d56Sopenharmony_ci   tuple with a replacement for the unencodable part of the input and a position
4277db96d56Sopenharmony_ci   where encoding should continue. The replacement may be either :class:`str` or
4287db96d56Sopenharmony_ci   :class:`bytes`. If the replacement is bytes, the encoder will simply copy
4297db96d56Sopenharmony_ci   them into the output buffer. If the replacement is a string, the encoder will
4307db96d56Sopenharmony_ci   encode the replacement. Encoding continues on original input at the
4317db96d56Sopenharmony_ci   specified position. Negative position values will be treated as being
4327db96d56Sopenharmony_ci   relative to the end of the input string. If the resulting position is out of
4337db96d56Sopenharmony_ci   bound an :exc:`IndexError` will be raised.
4347db96d56Sopenharmony_ci
4357db96d56Sopenharmony_ci   Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or
4367db96d56Sopenharmony_ci   :exc:`UnicodeTranslateError` will be passed to the handler and that the
4377db96d56Sopenharmony_ci   replacement from the error handler will be put into the output directly.
4387db96d56Sopenharmony_ci
4397db96d56Sopenharmony_ci
4407db96d56Sopenharmony_ciPreviously registered error handlers (including the standard error handlers)
4417db96d56Sopenharmony_cican be looked up by name:
4427db96d56Sopenharmony_ci
4437db96d56Sopenharmony_ci.. function:: lookup_error(name)
4447db96d56Sopenharmony_ci
4457db96d56Sopenharmony_ci   Return the error handler previously registered under the name *name*.
4467db96d56Sopenharmony_ci
4477db96d56Sopenharmony_ci   Raises a :exc:`LookupError` in case the handler cannot be found.
4487db96d56Sopenharmony_ci
4497db96d56Sopenharmony_ciThe following standard error handlers are also made available as module level
4507db96d56Sopenharmony_cifunctions:
4517db96d56Sopenharmony_ci
4527db96d56Sopenharmony_ci.. function:: strict_errors(exception)
4537db96d56Sopenharmony_ci
4547db96d56Sopenharmony_ci   Implements the ``'strict'`` error handling.
4557db96d56Sopenharmony_ci
4567db96d56Sopenharmony_ci   Each encoding or decoding error raises a :exc:`UnicodeError`.
4577db96d56Sopenharmony_ci
4587db96d56Sopenharmony_ci
4597db96d56Sopenharmony_ci.. function:: ignore_errors(exception)
4607db96d56Sopenharmony_ci
4617db96d56Sopenharmony_ci   Implements the ``'ignore'`` error handling.
4627db96d56Sopenharmony_ci
4637db96d56Sopenharmony_ci   Malformed data is ignored; encoding or decoding is continued without
4647db96d56Sopenharmony_ci   further notice.
4657db96d56Sopenharmony_ci
4667db96d56Sopenharmony_ci
4677db96d56Sopenharmony_ci.. function:: replace_errors(exception)
4687db96d56Sopenharmony_ci
4697db96d56Sopenharmony_ci   Implements the ``'replace'`` error handling.
4707db96d56Sopenharmony_ci
4717db96d56Sopenharmony_ci   Substitutes ``?`` (ASCII character) for encoding errors or ``�`` (U+FFFD,
4727db96d56Sopenharmony_ci   the official REPLACEMENT CHARACTER) for decoding errors.
4737db96d56Sopenharmony_ci
4747db96d56Sopenharmony_ci
4757db96d56Sopenharmony_ci.. function:: backslashreplace_errors(exception)
4767db96d56Sopenharmony_ci
4777db96d56Sopenharmony_ci   Implements the ``'backslashreplace'`` error handling.
4787db96d56Sopenharmony_ci
4797db96d56Sopenharmony_ci   Malformed data is replaced by a backslashed escape sequence.
4807db96d56Sopenharmony_ci   On encoding, use the hexadecimal form of Unicode code point with formats
4817db96d56Sopenharmony_ci   ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of
4827db96d56Sopenharmony_ci   byte value with format ``\xhh``.
4837db96d56Sopenharmony_ci
4847db96d56Sopenharmony_ci   .. versionchanged:: 3.5
4857db96d56Sopenharmony_ci      Works with decoding and translating.
4867db96d56Sopenharmony_ci
4877db96d56Sopenharmony_ci
4887db96d56Sopenharmony_ci.. function:: xmlcharrefreplace_errors(exception)
4897db96d56Sopenharmony_ci
4907db96d56Sopenharmony_ci   Implements the ``'xmlcharrefreplace'`` error handling (for encoding within
4917db96d56Sopenharmony_ci   :term:`text encoding` only).
4927db96d56Sopenharmony_ci
4937db96d56Sopenharmony_ci   The unencodable character is replaced by an appropriate XML/HTML numeric
4947db96d56Sopenharmony_ci   character reference, which is a decimal form of Unicode code point with
4957db96d56Sopenharmony_ci   format ``&#num;`` .
4967db96d56Sopenharmony_ci
4977db96d56Sopenharmony_ci
4987db96d56Sopenharmony_ci.. function:: namereplace_errors(exception)
4997db96d56Sopenharmony_ci
5007db96d56Sopenharmony_ci   Implements the ``'namereplace'`` error handling (for encoding within
5017db96d56Sopenharmony_ci   :term:`text encoding` only).
5027db96d56Sopenharmony_ci
5037db96d56Sopenharmony_ci   The unencodable character is replaced by a ``\N{...}`` escape sequence. The
5047db96d56Sopenharmony_ci   set of characters that appear in the braces is the Name property from
5057db96d56Sopenharmony_ci   Unicode Character Database. For example, the German lowercase letter ``'ß'``
5067db96d56Sopenharmony_ci   will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` .
5077db96d56Sopenharmony_ci
5087db96d56Sopenharmony_ci   .. versionadded:: 3.5
5097db96d56Sopenharmony_ci
5107db96d56Sopenharmony_ci
5117db96d56Sopenharmony_ci.. _codec-objects:
5127db96d56Sopenharmony_ci
5137db96d56Sopenharmony_ciStateless Encoding and Decoding
5147db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5157db96d56Sopenharmony_ci
5167db96d56Sopenharmony_ciThe base :class:`Codec` class defines these methods which also define the
5177db96d56Sopenharmony_cifunction interfaces of the stateless encoder and decoder:
5187db96d56Sopenharmony_ci
5197db96d56Sopenharmony_ci
5207db96d56Sopenharmony_ci.. method:: Codec.encode(input, errors='strict')
5217db96d56Sopenharmony_ci
5227db96d56Sopenharmony_ci   Encodes the object *input* and returns a tuple (output object, length consumed).
5237db96d56Sopenharmony_ci   For instance, :term:`text encoding` converts
5247db96d56Sopenharmony_ci   a string object to a bytes object using a particular
5257db96d56Sopenharmony_ci   character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
5267db96d56Sopenharmony_ci
5277db96d56Sopenharmony_ci   The *errors* argument defines the error handling to apply.
5287db96d56Sopenharmony_ci   It defaults to ``'strict'`` handling.
5297db96d56Sopenharmony_ci
5307db96d56Sopenharmony_ci   The method may not store state in the :class:`Codec` instance. Use
5317db96d56Sopenharmony_ci   :class:`StreamWriter` for codecs which have to keep state in order to make
5327db96d56Sopenharmony_ci   encoding efficient.
5337db96d56Sopenharmony_ci
5347db96d56Sopenharmony_ci   The encoder must be able to handle zero length input and return an empty object
5357db96d56Sopenharmony_ci   of the output object type in this situation.
5367db96d56Sopenharmony_ci
5377db96d56Sopenharmony_ci
5387db96d56Sopenharmony_ci.. method:: Codec.decode(input, errors='strict')
5397db96d56Sopenharmony_ci
5407db96d56Sopenharmony_ci   Decodes the object *input* and returns a tuple (output object, length
5417db96d56Sopenharmony_ci   consumed). For instance, for a :term:`text encoding`, decoding converts
5427db96d56Sopenharmony_ci   a bytes object encoded using a particular
5437db96d56Sopenharmony_ci   character set encoding to a string object.
5447db96d56Sopenharmony_ci
5457db96d56Sopenharmony_ci   For text encodings and bytes-to-bytes codecs,
5467db96d56Sopenharmony_ci   *input* must be a bytes object or one which provides the read-only
5477db96d56Sopenharmony_ci   buffer interface -- for example, buffer objects and memory mapped files.
5487db96d56Sopenharmony_ci
5497db96d56Sopenharmony_ci   The *errors* argument defines the error handling to apply.
5507db96d56Sopenharmony_ci   It defaults to ``'strict'`` handling.
5517db96d56Sopenharmony_ci
5527db96d56Sopenharmony_ci   The method may not store state in the :class:`Codec` instance. Use
5537db96d56Sopenharmony_ci   :class:`StreamReader` for codecs which have to keep state in order to make
5547db96d56Sopenharmony_ci   decoding efficient.
5557db96d56Sopenharmony_ci
5567db96d56Sopenharmony_ci   The decoder must be able to handle zero length input and return an empty object
5577db96d56Sopenharmony_ci   of the output object type in this situation.
5587db96d56Sopenharmony_ci
5597db96d56Sopenharmony_ci
5607db96d56Sopenharmony_ciIncremental Encoding and Decoding
5617db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5627db96d56Sopenharmony_ci
5637db96d56Sopenharmony_ciThe :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
5647db96d56Sopenharmony_cithe basic interface for incremental encoding and decoding. Encoding/decoding the
5657db96d56Sopenharmony_ciinput isn't done with one call to the stateless encoder/decoder function, but
5667db96d56Sopenharmony_ciwith multiple calls to the
5677db96d56Sopenharmony_ci:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
5687db96d56Sopenharmony_cithe incremental encoder/decoder. The incremental encoder/decoder keeps track of
5697db96d56Sopenharmony_cithe encoding/decoding process during method calls.
5707db96d56Sopenharmony_ci
5717db96d56Sopenharmony_ciThe joined output of calls to the
5727db96d56Sopenharmony_ci:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
5737db96d56Sopenharmony_cithe same as if all the single inputs were joined into one, and this input was
5747db96d56Sopenharmony_ciencoded/decoded with the stateless encoder/decoder.
5757db96d56Sopenharmony_ci
5767db96d56Sopenharmony_ci
5777db96d56Sopenharmony_ci.. _incremental-encoder-objects:
5787db96d56Sopenharmony_ci
5797db96d56Sopenharmony_ciIncrementalEncoder Objects
5807db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
5817db96d56Sopenharmony_ci
5827db96d56Sopenharmony_ciThe :class:`IncrementalEncoder` class is used for encoding an input in multiple
5837db96d56Sopenharmony_cisteps. It defines the following methods which every incremental encoder must
5847db96d56Sopenharmony_cidefine in order to be compatible with the Python codec registry.
5857db96d56Sopenharmony_ci
5867db96d56Sopenharmony_ci
5877db96d56Sopenharmony_ci.. class:: IncrementalEncoder(errors='strict')
5887db96d56Sopenharmony_ci
5897db96d56Sopenharmony_ci   Constructor for an :class:`IncrementalEncoder` instance.
5907db96d56Sopenharmony_ci
5917db96d56Sopenharmony_ci   All incremental encoders must provide this constructor interface. They are free
5927db96d56Sopenharmony_ci   to add additional keyword arguments, but only the ones defined here are used by
5937db96d56Sopenharmony_ci   the Python codec registry.
5947db96d56Sopenharmony_ci
5957db96d56Sopenharmony_ci   The :class:`IncrementalEncoder` may implement different error handling schemes
5967db96d56Sopenharmony_ci   by providing the *errors* keyword argument. See :ref:`error-handlers` for
5977db96d56Sopenharmony_ci   possible values.
5987db96d56Sopenharmony_ci
5997db96d56Sopenharmony_ci   The *errors* argument will be assigned to an attribute of the same name.
6007db96d56Sopenharmony_ci   Assigning to this attribute makes it possible to switch between different error
6017db96d56Sopenharmony_ci   handling strategies during the lifetime of the :class:`IncrementalEncoder`
6027db96d56Sopenharmony_ci   object.
6037db96d56Sopenharmony_ci
6047db96d56Sopenharmony_ci
6057db96d56Sopenharmony_ci   .. method:: encode(object, final=False)
6067db96d56Sopenharmony_ci
6077db96d56Sopenharmony_ci      Encodes *object* (taking the current state of the encoder into account)
6087db96d56Sopenharmony_ci      and returns the resulting encoded object. If this is the last call to
6097db96d56Sopenharmony_ci      :meth:`encode` *final* must be true (the default is false).
6107db96d56Sopenharmony_ci
6117db96d56Sopenharmony_ci
6127db96d56Sopenharmony_ci   .. method:: reset()
6137db96d56Sopenharmony_ci
6147db96d56Sopenharmony_ci      Reset the encoder to the initial state. The output is discarded: call
6157db96d56Sopenharmony_ci      ``.encode(object, final=True)``, passing an empty byte or text string
6167db96d56Sopenharmony_ci      if necessary, to reset the encoder and to get the output.
6177db96d56Sopenharmony_ci
6187db96d56Sopenharmony_ci
6197db96d56Sopenharmony_ci   .. method:: getstate()
6207db96d56Sopenharmony_ci
6217db96d56Sopenharmony_ci      Return the current state of the encoder which must be an integer. The
6227db96d56Sopenharmony_ci      implementation should make sure that ``0`` is the most common
6237db96d56Sopenharmony_ci      state. (States that are more complicated than integers can be converted
6247db96d56Sopenharmony_ci      into an integer by marshaling/pickling the state and encoding the bytes
6257db96d56Sopenharmony_ci      of the resulting string into an integer.)
6267db96d56Sopenharmony_ci
6277db96d56Sopenharmony_ci
6287db96d56Sopenharmony_ci   .. method:: setstate(state)
6297db96d56Sopenharmony_ci
6307db96d56Sopenharmony_ci      Set the state of the encoder to *state*. *state* must be an encoder state
6317db96d56Sopenharmony_ci      returned by :meth:`getstate`.
6327db96d56Sopenharmony_ci
6337db96d56Sopenharmony_ci
6347db96d56Sopenharmony_ci.. _incremental-decoder-objects:
6357db96d56Sopenharmony_ci
6367db96d56Sopenharmony_ciIncrementalDecoder Objects
6377db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
6387db96d56Sopenharmony_ci
6397db96d56Sopenharmony_ciThe :class:`IncrementalDecoder` class is used for decoding an input in multiple
6407db96d56Sopenharmony_cisteps. It defines the following methods which every incremental decoder must
6417db96d56Sopenharmony_cidefine in order to be compatible with the Python codec registry.
6427db96d56Sopenharmony_ci
6437db96d56Sopenharmony_ci
6447db96d56Sopenharmony_ci.. class:: IncrementalDecoder(errors='strict')
6457db96d56Sopenharmony_ci
6467db96d56Sopenharmony_ci   Constructor for an :class:`IncrementalDecoder` instance.
6477db96d56Sopenharmony_ci
6487db96d56Sopenharmony_ci   All incremental decoders must provide this constructor interface. They are free
6497db96d56Sopenharmony_ci   to add additional keyword arguments, but only the ones defined here are used by
6507db96d56Sopenharmony_ci   the Python codec registry.
6517db96d56Sopenharmony_ci
6527db96d56Sopenharmony_ci   The :class:`IncrementalDecoder` may implement different error handling schemes
6537db96d56Sopenharmony_ci   by providing the *errors* keyword argument. See :ref:`error-handlers` for
6547db96d56Sopenharmony_ci   possible values.
6557db96d56Sopenharmony_ci
6567db96d56Sopenharmony_ci   The *errors* argument will be assigned to an attribute of the same name.
6577db96d56Sopenharmony_ci   Assigning to this attribute makes it possible to switch between different error
6587db96d56Sopenharmony_ci   handling strategies during the lifetime of the :class:`IncrementalDecoder`
6597db96d56Sopenharmony_ci   object.
6607db96d56Sopenharmony_ci
6617db96d56Sopenharmony_ci
6627db96d56Sopenharmony_ci   .. method:: decode(object, final=False)
6637db96d56Sopenharmony_ci
6647db96d56Sopenharmony_ci      Decodes *object* (taking the current state of the decoder into account)
6657db96d56Sopenharmony_ci      and returns the resulting decoded object. If this is the last call to
6667db96d56Sopenharmony_ci      :meth:`decode` *final* must be true (the default is false). If *final* is
6677db96d56Sopenharmony_ci      true the decoder must decode the input completely and must flush all
6687db96d56Sopenharmony_ci      buffers. If this isn't possible (e.g. because of incomplete byte sequences
6697db96d56Sopenharmony_ci      at the end of the input) it must initiate error handling just like in the
6707db96d56Sopenharmony_ci      stateless case (which might raise an exception).
6717db96d56Sopenharmony_ci
6727db96d56Sopenharmony_ci
6737db96d56Sopenharmony_ci   .. method:: reset()
6747db96d56Sopenharmony_ci
6757db96d56Sopenharmony_ci      Reset the decoder to the initial state.
6767db96d56Sopenharmony_ci
6777db96d56Sopenharmony_ci
6787db96d56Sopenharmony_ci   .. method:: getstate()
6797db96d56Sopenharmony_ci
6807db96d56Sopenharmony_ci      Return the current state of the decoder. This must be a tuple with two
6817db96d56Sopenharmony_ci      items, the first must be the buffer containing the still undecoded
6827db96d56Sopenharmony_ci      input. The second must be an integer and can be additional state
6837db96d56Sopenharmony_ci      info. (The implementation should make sure that ``0`` is the most common
6847db96d56Sopenharmony_ci      additional state info.) If this additional state info is ``0`` it must be
6857db96d56Sopenharmony_ci      possible to set the decoder to the state which has no input buffered and
6867db96d56Sopenharmony_ci      ``0`` as the additional state info, so that feeding the previously
6877db96d56Sopenharmony_ci      buffered input to the decoder returns it to the previous state without
6887db96d56Sopenharmony_ci      producing any output. (Additional state info that is more complicated than
6897db96d56Sopenharmony_ci      integers can be converted into an integer by marshaling/pickling the info
6907db96d56Sopenharmony_ci      and encoding the bytes of the resulting string into an integer.)
6917db96d56Sopenharmony_ci
6927db96d56Sopenharmony_ci
6937db96d56Sopenharmony_ci   .. method:: setstate(state)
6947db96d56Sopenharmony_ci
6957db96d56Sopenharmony_ci      Set the state of the decoder to *state*. *state* must be a decoder state
6967db96d56Sopenharmony_ci      returned by :meth:`getstate`.
6977db96d56Sopenharmony_ci
6987db96d56Sopenharmony_ci
6997db96d56Sopenharmony_ciStream Encoding and Decoding
7007db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7017db96d56Sopenharmony_ci
7027db96d56Sopenharmony_ci
7037db96d56Sopenharmony_ciThe :class:`StreamWriter` and :class:`StreamReader` classes provide generic
7047db96d56Sopenharmony_ciworking interfaces which can be used to implement new encoding submodules very
7057db96d56Sopenharmony_cieasily. See :mod:`encodings.utf_8` for an example of how this is done.
7067db96d56Sopenharmony_ci
7077db96d56Sopenharmony_ci
7087db96d56Sopenharmony_ci.. _stream-writer-objects:
7097db96d56Sopenharmony_ci
7107db96d56Sopenharmony_ciStreamWriter Objects
7117db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
7127db96d56Sopenharmony_ci
7137db96d56Sopenharmony_ciThe :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
7147db96d56Sopenharmony_cifollowing methods which every stream writer must define in order to be
7157db96d56Sopenharmony_cicompatible with the Python codec registry.
7167db96d56Sopenharmony_ci
7177db96d56Sopenharmony_ci
7187db96d56Sopenharmony_ci.. class:: StreamWriter(stream, errors='strict')
7197db96d56Sopenharmony_ci
7207db96d56Sopenharmony_ci   Constructor for a :class:`StreamWriter` instance.
7217db96d56Sopenharmony_ci
7227db96d56Sopenharmony_ci   All stream writers must provide this constructor interface. They are free to add
7237db96d56Sopenharmony_ci   additional keyword arguments, but only the ones defined here are used by the
7247db96d56Sopenharmony_ci   Python codec registry.
7257db96d56Sopenharmony_ci
7267db96d56Sopenharmony_ci   The *stream* argument must be a file-like object open for writing
7277db96d56Sopenharmony_ci   text or binary data, as appropriate for the specific codec.
7287db96d56Sopenharmony_ci
7297db96d56Sopenharmony_ci   The :class:`StreamWriter` may implement different error handling schemes by
7307db96d56Sopenharmony_ci   providing the *errors* keyword argument. See :ref:`error-handlers` for
7317db96d56Sopenharmony_ci   the standard error handlers the underlying stream codec may support.
7327db96d56Sopenharmony_ci
7337db96d56Sopenharmony_ci   The *errors* argument will be assigned to an attribute of the same name.
7347db96d56Sopenharmony_ci   Assigning to this attribute makes it possible to switch between different error
7357db96d56Sopenharmony_ci   handling strategies during the lifetime of the :class:`StreamWriter` object.
7367db96d56Sopenharmony_ci
7377db96d56Sopenharmony_ci   .. method:: write(object)
7387db96d56Sopenharmony_ci
7397db96d56Sopenharmony_ci      Writes the object's contents encoded to the stream.
7407db96d56Sopenharmony_ci
7417db96d56Sopenharmony_ci
7427db96d56Sopenharmony_ci   .. method:: writelines(list)
7437db96d56Sopenharmony_ci
7447db96d56Sopenharmony_ci      Writes the concatenated iterable of strings to the stream (possibly by reusing
7457db96d56Sopenharmony_ci      the :meth:`write` method). Infinite or
7467db96d56Sopenharmony_ci      very large iterables are not supported. The standard bytes-to-bytes codecs
7477db96d56Sopenharmony_ci      do not support this method.
7487db96d56Sopenharmony_ci
7497db96d56Sopenharmony_ci
7507db96d56Sopenharmony_ci   .. method:: reset()
7517db96d56Sopenharmony_ci
7527db96d56Sopenharmony_ci      Resets the codec buffers used for keeping internal state.
7537db96d56Sopenharmony_ci
7547db96d56Sopenharmony_ci      Calling this method should ensure that the data on the output is put into
7557db96d56Sopenharmony_ci      a clean state that allows appending of new fresh data without having to
7567db96d56Sopenharmony_ci      rescan the whole stream to recover state.
7577db96d56Sopenharmony_ci
7587db96d56Sopenharmony_ci
7597db96d56Sopenharmony_ciIn addition to the above methods, the :class:`StreamWriter` must also inherit
7607db96d56Sopenharmony_ciall other methods and attributes from the underlying stream.
7617db96d56Sopenharmony_ci
7627db96d56Sopenharmony_ci
7637db96d56Sopenharmony_ci.. _stream-reader-objects:
7647db96d56Sopenharmony_ci
7657db96d56Sopenharmony_ciStreamReader Objects
7667db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
7677db96d56Sopenharmony_ci
7687db96d56Sopenharmony_ciThe :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
7697db96d56Sopenharmony_cifollowing methods which every stream reader must define in order to be
7707db96d56Sopenharmony_cicompatible with the Python codec registry.
7717db96d56Sopenharmony_ci
7727db96d56Sopenharmony_ci
7737db96d56Sopenharmony_ci.. class:: StreamReader(stream, errors='strict')
7747db96d56Sopenharmony_ci
7757db96d56Sopenharmony_ci   Constructor for a :class:`StreamReader` instance.
7767db96d56Sopenharmony_ci
7777db96d56Sopenharmony_ci   All stream readers must provide this constructor interface. They are free to add
7787db96d56Sopenharmony_ci   additional keyword arguments, but only the ones defined here are used by the
7797db96d56Sopenharmony_ci   Python codec registry.
7807db96d56Sopenharmony_ci
7817db96d56Sopenharmony_ci   The *stream* argument must be a file-like object open for reading
7827db96d56Sopenharmony_ci   text or binary data, as appropriate for the specific codec.
7837db96d56Sopenharmony_ci
7847db96d56Sopenharmony_ci   The :class:`StreamReader` may implement different error handling schemes by
7857db96d56Sopenharmony_ci   providing the *errors* keyword argument. See :ref:`error-handlers` for
7867db96d56Sopenharmony_ci   the standard error handlers the underlying stream codec may support.
7877db96d56Sopenharmony_ci
7887db96d56Sopenharmony_ci   The *errors* argument will be assigned to an attribute of the same name.
7897db96d56Sopenharmony_ci   Assigning to this attribute makes it possible to switch between different error
7907db96d56Sopenharmony_ci   handling strategies during the lifetime of the :class:`StreamReader` object.
7917db96d56Sopenharmony_ci
7927db96d56Sopenharmony_ci   The set of allowed values for the *errors* argument can be extended with
7937db96d56Sopenharmony_ci   :func:`register_error`.
7947db96d56Sopenharmony_ci
7957db96d56Sopenharmony_ci
7967db96d56Sopenharmony_ci   .. method:: read(size=-1, chars=-1, firstline=False)
7977db96d56Sopenharmony_ci
7987db96d56Sopenharmony_ci      Decodes data from the stream and returns the resulting object.
7997db96d56Sopenharmony_ci
8007db96d56Sopenharmony_ci      The *chars* argument indicates the number of decoded
8017db96d56Sopenharmony_ci      code points or bytes to return. The :func:`read` method will
8027db96d56Sopenharmony_ci      never return more data than requested, but it might return less,
8037db96d56Sopenharmony_ci      if there is not enough available.
8047db96d56Sopenharmony_ci
8057db96d56Sopenharmony_ci      The *size* argument indicates the approximate maximum
8067db96d56Sopenharmony_ci      number of encoded bytes or code points to read
8077db96d56Sopenharmony_ci      for decoding. The decoder can modify this setting as
8087db96d56Sopenharmony_ci      appropriate. The default value -1 indicates to read and decode as much as
8097db96d56Sopenharmony_ci      possible. This parameter is intended to
8107db96d56Sopenharmony_ci      prevent having to decode huge files in one step.
8117db96d56Sopenharmony_ci
8127db96d56Sopenharmony_ci      The *firstline* flag indicates that
8137db96d56Sopenharmony_ci      it would be sufficient to only return the first
8147db96d56Sopenharmony_ci      line, if there are decoding errors on later lines.
8157db96d56Sopenharmony_ci
8167db96d56Sopenharmony_ci      The method should use a greedy read strategy meaning that it should read
8177db96d56Sopenharmony_ci      as much data as is allowed within the definition of the encoding and the
8187db96d56Sopenharmony_ci      given size, e.g.  if optional encoding endings or state markers are
8197db96d56Sopenharmony_ci      available on the stream, these should be read too.
8207db96d56Sopenharmony_ci
8217db96d56Sopenharmony_ci
8227db96d56Sopenharmony_ci   .. method:: readline(size=None, keepends=True)
8237db96d56Sopenharmony_ci
8247db96d56Sopenharmony_ci      Read one line from the input stream and return the decoded data.
8257db96d56Sopenharmony_ci
8267db96d56Sopenharmony_ci      *size*, if given, is passed as size argument to the stream's
8277db96d56Sopenharmony_ci      :meth:`read` method.
8287db96d56Sopenharmony_ci
8297db96d56Sopenharmony_ci      If *keepends* is false line-endings will be stripped from the lines
8307db96d56Sopenharmony_ci      returned.
8317db96d56Sopenharmony_ci
8327db96d56Sopenharmony_ci
8337db96d56Sopenharmony_ci   .. method:: readlines(sizehint=None, keepends=True)
8347db96d56Sopenharmony_ci
8357db96d56Sopenharmony_ci      Read all lines available on the input stream and return them as a list of
8367db96d56Sopenharmony_ci      lines.
8377db96d56Sopenharmony_ci
8387db96d56Sopenharmony_ci      Line-endings are implemented using the codec's :meth:`decode` method and
8397db96d56Sopenharmony_ci      are included in the list entries if *keepends* is true.
8407db96d56Sopenharmony_ci
8417db96d56Sopenharmony_ci      *sizehint*, if given, is passed as the *size* argument to the stream's
8427db96d56Sopenharmony_ci      :meth:`read` method.
8437db96d56Sopenharmony_ci
8447db96d56Sopenharmony_ci
8457db96d56Sopenharmony_ci   .. method:: reset()
8467db96d56Sopenharmony_ci
8477db96d56Sopenharmony_ci      Resets the codec buffers used for keeping internal state.
8487db96d56Sopenharmony_ci
8497db96d56Sopenharmony_ci      Note that no stream repositioning should take place. This method is
8507db96d56Sopenharmony_ci      primarily intended to be able to recover from decoding errors.
8517db96d56Sopenharmony_ci
8527db96d56Sopenharmony_ci
8537db96d56Sopenharmony_ciIn addition to the above methods, the :class:`StreamReader` must also inherit
8547db96d56Sopenharmony_ciall other methods and attributes from the underlying stream.
8557db96d56Sopenharmony_ci
8567db96d56Sopenharmony_ci.. _stream-reader-writer:
8577db96d56Sopenharmony_ci
8587db96d56Sopenharmony_ciStreamReaderWriter Objects
8597db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
8607db96d56Sopenharmony_ci
8617db96d56Sopenharmony_ciThe :class:`StreamReaderWriter` is a convenience class that allows wrapping
8627db96d56Sopenharmony_cistreams which work in both read and write modes.
8637db96d56Sopenharmony_ci
8647db96d56Sopenharmony_ciThe design is such that one can use the factory functions returned by the
8657db96d56Sopenharmony_ci:func:`lookup` function to construct the instance.
8667db96d56Sopenharmony_ci
8677db96d56Sopenharmony_ci
8687db96d56Sopenharmony_ci.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict')
8697db96d56Sopenharmony_ci
8707db96d56Sopenharmony_ci   Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
8717db96d56Sopenharmony_ci   object. *Reader* and *Writer* must be factory functions or classes providing the
8727db96d56Sopenharmony_ci   :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
8737db96d56Sopenharmony_ci   is done in the same way as defined for the stream readers and writers.
8747db96d56Sopenharmony_ci
8757db96d56Sopenharmony_ci:class:`StreamReaderWriter` instances define the combined interfaces of
8767db96d56Sopenharmony_ci:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
8777db96d56Sopenharmony_cimethods and attributes from the underlying stream.
8787db96d56Sopenharmony_ci
8797db96d56Sopenharmony_ci
8807db96d56Sopenharmony_ci.. _stream-recoder-objects:
8817db96d56Sopenharmony_ci
8827db96d56Sopenharmony_ciStreamRecoder Objects
8837db96d56Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~
8847db96d56Sopenharmony_ci
8857db96d56Sopenharmony_ciThe :class:`StreamRecoder` translates data from one encoding to another,
8867db96d56Sopenharmony_ciwhich is sometimes useful when dealing with different encoding environments.
8877db96d56Sopenharmony_ci
8887db96d56Sopenharmony_ciThe design is such that one can use the factory functions returned by the
8897db96d56Sopenharmony_ci:func:`lookup` function to construct the instance.
8907db96d56Sopenharmony_ci
8917db96d56Sopenharmony_ci
8927db96d56Sopenharmony_ci.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')
8937db96d56Sopenharmony_ci
8947db96d56Sopenharmony_ci   Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
8957db96d56Sopenharmony_ci   *encode* and *decode* work on the frontend — the data visible to
8967db96d56Sopenharmony_ci   code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*
8977db96d56Sopenharmony_ci   work on the backend — the data in *stream*.
8987db96d56Sopenharmony_ci
8997db96d56Sopenharmony_ci   You can use these objects to do transparent transcodings, e.g., from Latin-1
9007db96d56Sopenharmony_ci   to UTF-8 and back.
9017db96d56Sopenharmony_ci
9027db96d56Sopenharmony_ci   The *stream* argument must be a file-like object.
9037db96d56Sopenharmony_ci
9047db96d56Sopenharmony_ci   The *encode* and *decode* arguments must
9057db96d56Sopenharmony_ci   adhere to the :class:`Codec` interface. *Reader* and
9067db96d56Sopenharmony_ci   *Writer* must be factory functions or classes providing objects of the
9077db96d56Sopenharmony_ci   :class:`StreamReader` and :class:`StreamWriter` interface respectively.
9087db96d56Sopenharmony_ci
9097db96d56Sopenharmony_ci   Error handling is done in the same way as defined for the stream readers and
9107db96d56Sopenharmony_ci   writers.
9117db96d56Sopenharmony_ci
9127db96d56Sopenharmony_ci
9137db96d56Sopenharmony_ci:class:`StreamRecoder` instances define the combined interfaces of
9147db96d56Sopenharmony_ci:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
9157db96d56Sopenharmony_cimethods and attributes from the underlying stream.
9167db96d56Sopenharmony_ci
9177db96d56Sopenharmony_ci
9187db96d56Sopenharmony_ci.. _encodings-overview:
9197db96d56Sopenharmony_ci
9207db96d56Sopenharmony_ciEncodings and Unicode
9217db96d56Sopenharmony_ci---------------------
9227db96d56Sopenharmony_ci
9237db96d56Sopenharmony_ciStrings are stored internally as sequences of code points in
9247db96d56Sopenharmony_cirange ``U+0000``--``U+10FFFF``. (See :pep:`393` for
9257db96d56Sopenharmony_cimore details about the implementation.)
9267db96d56Sopenharmony_ciOnce a string object is used outside of CPU and memory, endianness
9277db96d56Sopenharmony_ciand how these arrays are stored as bytes become an issue. As with other
9287db96d56Sopenharmony_cicodecs, serialising a string into a sequence of bytes is known as *encoding*,
9297db96d56Sopenharmony_ciand recreating the string from the sequence of bytes is known as *decoding*.
9307db96d56Sopenharmony_ciThere are a variety of different text serialisation codecs, which are
9317db96d56Sopenharmony_cicollectivity referred to as :term:`text encodings <text encoding>`.
9327db96d56Sopenharmony_ci
9337db96d56Sopenharmony_ciThe simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
9347db96d56Sopenharmony_cithe code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string
9357db96d56Sopenharmony_ciobject that contains code points above ``U+00FF`` can't be encoded with this
9367db96d56Sopenharmony_cicodec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
9377db96d56Sopenharmony_cilike the following (although the details of the error message may differ):
9387db96d56Sopenharmony_ci``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
9397db96d56Sopenharmony_ciposition 3: ordinal not in range(256)``.
9407db96d56Sopenharmony_ci
9417db96d56Sopenharmony_ciThere's another group of encodings (the so called charmap encodings) that choose
9427db96d56Sopenharmony_cia different subset of all Unicode code points and how these code points are
9437db96d56Sopenharmony_cimapped to the bytes ``0x0``--``0xff``. To see how this is done simply open
9447db96d56Sopenharmony_cie.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
9457db96d56Sopenharmony_ciWindows). There's a string constant with 256 characters that shows you which
9467db96d56Sopenharmony_cicharacter is mapped to which byte value.
9477db96d56Sopenharmony_ci
9487db96d56Sopenharmony_ciAll of these encodings can only encode 256 of the 1114112 code points
9497db96d56Sopenharmony_cidefined in Unicode. A simple and straightforward way that can store each Unicode
9507db96d56Sopenharmony_cicode point, is to store each code point as four consecutive bytes. There are two
9517db96d56Sopenharmony_cipossibilities: store the bytes in big endian or in little endian order. These
9527db96d56Sopenharmony_citwo encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
9537db96d56Sopenharmony_cidisadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
9547db96d56Sopenharmony_ciwill always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
9557db96d56Sopenharmony_ciproblem: bytes will always be in natural endianness. When these bytes are read
9567db96d56Sopenharmony_ciby a CPU with a different endianness, then bytes have to be swapped though. To
9577db96d56Sopenharmony_cibe able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
9587db96d56Sopenharmony_cithere's the so called BOM ("Byte Order Mark"). This is the Unicode character
9597db96d56Sopenharmony_ci``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
9607db96d56Sopenharmony_cibyte sequence. The byte swapped version of this character (``0xFFFE``) is an
9617db96d56Sopenharmony_ciillegal character that may not appear in a Unicode text. So when the
9627db96d56Sopenharmony_cifirst character in a ``UTF-16`` or ``UTF-32`` byte sequence
9637db96d56Sopenharmony_ciappears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
9647db96d56Sopenharmony_ciUnfortunately the character ``U+FEFF`` had a second purpose as
9657db96d56Sopenharmony_cia ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
9667db96d56Sopenharmony_cia word to be split. It can e.g. be used to give hints to a ligature algorithm.
9677db96d56Sopenharmony_ciWith Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
9687db96d56Sopenharmony_cideprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
9697db96d56Sopenharmony_ciUnicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
9707db96d56Sopenharmony_ciit's a device to determine the storage layout of the encoded bytes, and vanishes
9717db96d56Sopenharmony_cionce the byte sequence has been decoded into a string; as a ``ZERO WIDTH
9727db96d56Sopenharmony_ciNO-BREAK SPACE`` it's a normal character that will be decoded like any other.
9737db96d56Sopenharmony_ci
9747db96d56Sopenharmony_ciThere's another encoding that is able to encode the full range of Unicode
9757db96d56Sopenharmony_cicharacters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
9767db96d56Sopenharmony_ciwith byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
9777db96d56Sopenharmony_ciparts: marker bits (the most significant bits) and payload bits. The marker bits
9787db96d56Sopenharmony_ciare a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
9797db96d56Sopenharmony_ciencoded like this (with x being payload bits, which when concatenated give the
9807db96d56Sopenharmony_ciUnicode character):
9817db96d56Sopenharmony_ci
9827db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+
9837db96d56Sopenharmony_ci| Range                             | Encoding                                     |
9847db96d56Sopenharmony_ci+===================================+==============================================+
9857db96d56Sopenharmony_ci| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx                                     |
9867db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+
9877db96d56Sopenharmony_ci| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx                            |
9887db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+
9897db96d56Sopenharmony_ci| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
9907db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+
9917db96d56Sopenharmony_ci| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
9927db96d56Sopenharmony_ci+-----------------------------------+----------------------------------------------+
9937db96d56Sopenharmony_ci
9947db96d56Sopenharmony_ciThe least significant bit of the Unicode character is the rightmost x bit.
9957db96d56Sopenharmony_ci
9967db96d56Sopenharmony_ciAs UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
9977db96d56Sopenharmony_cithe decoded string (even if it's the first character) is treated as a ``ZERO
9987db96d56Sopenharmony_ciWIDTH NO-BREAK SPACE``.
9997db96d56Sopenharmony_ci
10007db96d56Sopenharmony_ciWithout external information it's impossible to reliably determine which
10017db96d56Sopenharmony_ciencoding was used for encoding a string. Each charmap encoding can
10027db96d56Sopenharmony_cidecode any random byte sequence. However that's not possible with UTF-8, as
10037db96d56Sopenharmony_ciUTF-8 byte sequences have a structure that doesn't allow arbitrary byte
10047db96d56Sopenharmony_cisequences. To increase the reliability with which a UTF-8 encoding can be
10057db96d56Sopenharmony_cidetected, Microsoft invented a variant of UTF-8 (that Python calls
10067db96d56Sopenharmony_ci``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
10077db96d56Sopenharmony_ciis written to the file, a UTF-8 encoded BOM (which looks like this as a byte
10087db96d56Sopenharmony_cisequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
10097db96d56Sopenharmony_cithat any charmap encoded file starts with these byte values (which would e.g.
10107db96d56Sopenharmony_cimap to
10117db96d56Sopenharmony_ci
10127db96d56Sopenharmony_ci   | LATIN SMALL LETTER I WITH DIAERESIS
10137db96d56Sopenharmony_ci   | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
10147db96d56Sopenharmony_ci   | INVERTED QUESTION MARK
10157db96d56Sopenharmony_ci
10167db96d56Sopenharmony_ciin iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
10177db96d56Sopenharmony_cicorrectly guessed from the byte sequence. So here the BOM is not used to be able
10187db96d56Sopenharmony_cito determine the byte order used for generating the byte sequence, but as a
10197db96d56Sopenharmony_cisignature that helps in guessing the encoding. On encoding the utf-8-sig codec
10207db96d56Sopenharmony_ciwill write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
10217db96d56Sopenharmony_cidecoding ``utf-8-sig`` will skip those three bytes if they appear as the first
10227db96d56Sopenharmony_cithree bytes in the file. In UTF-8, the use of the BOM is discouraged and
10237db96d56Sopenharmony_cishould generally be avoided.
10247db96d56Sopenharmony_ci
10257db96d56Sopenharmony_ci
10267db96d56Sopenharmony_ci.. _standard-encodings:
10277db96d56Sopenharmony_ci
10287db96d56Sopenharmony_ciStandard Encodings
10297db96d56Sopenharmony_ci------------------
10307db96d56Sopenharmony_ci
10317db96d56Sopenharmony_ciPython comes with a number of codecs built-in, either implemented as C functions
10327db96d56Sopenharmony_cior with dictionaries as mapping tables. The following table lists the codecs by
10337db96d56Sopenharmony_ciname, together with a few common aliases, and the languages for which the
10347db96d56Sopenharmony_ciencoding is likely used. Neither the list of aliases nor the list of languages
10357db96d56Sopenharmony_ciis meant to be exhaustive. Notice that spelling alternatives that only differ in
10367db96d56Sopenharmony_cicase or use a hyphen instead of an underscore are also valid aliases; therefore,
10377db96d56Sopenharmony_cie.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
10387db96d56Sopenharmony_ci
10397db96d56Sopenharmony_ci.. impl-detail::
10407db96d56Sopenharmony_ci
10417db96d56Sopenharmony_ci   Some common encodings can bypass the codecs lookup machinery to
10427db96d56Sopenharmony_ci   improve performance. These optimization opportunities are only
10437db96d56Sopenharmony_ci   recognized by CPython for a limited set of (case insensitive)
10447db96d56Sopenharmony_ci   aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs
10457db96d56Sopenharmony_ci   (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and
10467db96d56Sopenharmony_ci   the same using underscores instead of dashes. Using alternative
10477db96d56Sopenharmony_ci   aliases for these encodings may result in slower execution.
10487db96d56Sopenharmony_ci
10497db96d56Sopenharmony_ci   .. versionchanged:: 3.6
10507db96d56Sopenharmony_ci      Optimization opportunity recognized for us-ascii.
10517db96d56Sopenharmony_ci
10527db96d56Sopenharmony_ciMany of the character sets support the same languages. They vary in individual
10537db96d56Sopenharmony_cicharacters (e.g. whether the EURO SIGN is supported or not), and in the
10547db96d56Sopenharmony_ciassignment of characters to code positions. For the European languages in
10557db96d56Sopenharmony_ciparticular, the following variants typically exist:
10567db96d56Sopenharmony_ci
10577db96d56Sopenharmony_ci* an ISO 8859 codeset
10587db96d56Sopenharmony_ci
10597db96d56Sopenharmony_ci* a Microsoft Windows code page, which is typically derived from an 8859 codeset,
10607db96d56Sopenharmony_ci  but replaces control characters with additional graphic characters
10617db96d56Sopenharmony_ci
10627db96d56Sopenharmony_ci* an IBM EBCDIC code page
10637db96d56Sopenharmony_ci
10647db96d56Sopenharmony_ci* an IBM PC code page, which is ASCII compatible
10657db96d56Sopenharmony_ci
10667db96d56Sopenharmony_ci.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
10677db96d56Sopenharmony_ci
10687db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10697db96d56Sopenharmony_ci| Codec           | Aliases                        | Languages                      |
10707db96d56Sopenharmony_ci+=================+================================+================================+
10717db96d56Sopenharmony_ci| ascii           | 646, us-ascii                  | English                        |
10727db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10737db96d56Sopenharmony_ci| big5            | big5-tw, csbig5                | Traditional Chinese            |
10747db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10757db96d56Sopenharmony_ci| big5hkscs       | big5-hkscs, hkscs              | Traditional Chinese            |
10767db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10777db96d56Sopenharmony_ci| cp037           | IBM037, IBM039                 | English                        |
10787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10797db96d56Sopenharmony_ci| cp273           | 273, IBM273, csIBM273          | German                         |
10807db96d56Sopenharmony_ci|                 |                                |                                |
10817db96d56Sopenharmony_ci|                 |                                | .. versionadded:: 3.4          |
10827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10837db96d56Sopenharmony_ci| cp424           | EBCDIC-CP-HE, IBM424           | Hebrew                         |
10847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10857db96d56Sopenharmony_ci| cp437           | 437, IBM437                    | English                        |
10867db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10877db96d56Sopenharmony_ci| cp500           | EBCDIC-CP-BE, EBCDIC-CP-CH,    | Western Europe                 |
10887db96d56Sopenharmony_ci|                 | IBM500                         |                                |
10897db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10907db96d56Sopenharmony_ci| cp720           |                                | Arabic                         |
10917db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10927db96d56Sopenharmony_ci| cp737           |                                | Greek                          |
10937db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10947db96d56Sopenharmony_ci| cp775           | IBM775                         | Baltic languages               |
10957db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10967db96d56Sopenharmony_ci| cp850           | 850, IBM850                    | Western Europe                 |
10977db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
10987db96d56Sopenharmony_ci| cp852           | 852, IBM852                    | Central and Eastern Europe     |
10997db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11007db96d56Sopenharmony_ci| cp855           | 855, IBM855                    | Bulgarian, Byelorussian,       |
11017db96d56Sopenharmony_ci|                 |                                | Macedonian, Russian, Serbian   |
11027db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11037db96d56Sopenharmony_ci| cp856           |                                | Hebrew                         |
11047db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11057db96d56Sopenharmony_ci| cp857           | 857, IBM857                    | Turkish                        |
11067db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11077db96d56Sopenharmony_ci| cp858           | 858, IBM858                    | Western Europe                 |
11087db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11097db96d56Sopenharmony_ci| cp860           | 860, IBM860                    | Portuguese                     |
11107db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11117db96d56Sopenharmony_ci| cp861           | 861, CP-IS, IBM861             | Icelandic                      |
11127db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11137db96d56Sopenharmony_ci| cp862           | 862, IBM862                    | Hebrew                         |
11147db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11157db96d56Sopenharmony_ci| cp863           | 863, IBM863                    | Canadian                       |
11167db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11177db96d56Sopenharmony_ci| cp864           | IBM864                         | Arabic                         |
11187db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11197db96d56Sopenharmony_ci| cp865           | 865, IBM865                    | Danish, Norwegian              |
11207db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11217db96d56Sopenharmony_ci| cp866           | 866, IBM866                    | Russian                        |
11227db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11237db96d56Sopenharmony_ci| cp869           | 869, CP-GR, IBM869             | Greek                          |
11247db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11257db96d56Sopenharmony_ci| cp874           |                                | Thai                           |
11267db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11277db96d56Sopenharmony_ci| cp875           |                                | Greek                          |
11287db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11297db96d56Sopenharmony_ci| cp932           | 932, ms932, mskanji, ms-kanji  | Japanese                       |
11307db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11317db96d56Sopenharmony_ci| cp949           | 949, ms949, uhc                | Korean                         |
11327db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11337db96d56Sopenharmony_ci| cp950           | 950, ms950                     | Traditional Chinese            |
11347db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11357db96d56Sopenharmony_ci| cp1006          |                                | Urdu                           |
11367db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11377db96d56Sopenharmony_ci| cp1026          | ibm1026                        | Turkish                        |
11387db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11397db96d56Sopenharmony_ci| cp1125          | 1125, ibm1125, cp866u, ruscii  | Ukrainian                      |
11407db96d56Sopenharmony_ci|                 |                                |                                |
11417db96d56Sopenharmony_ci|                 |                                | .. versionadded:: 3.4          |
11427db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11437db96d56Sopenharmony_ci| cp1140          | ibm1140                        | Western Europe                 |
11447db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11457db96d56Sopenharmony_ci| cp1250          | windows-1250                   | Central and Eastern Europe     |
11467db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11477db96d56Sopenharmony_ci| cp1251          | windows-1251                   | Bulgarian, Byelorussian,       |
11487db96d56Sopenharmony_ci|                 |                                | Macedonian, Russian, Serbian   |
11497db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11507db96d56Sopenharmony_ci| cp1252          | windows-1252                   | Western Europe                 |
11517db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11527db96d56Sopenharmony_ci| cp1253          | windows-1253                   | Greek                          |
11537db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11547db96d56Sopenharmony_ci| cp1254          | windows-1254                   | Turkish                        |
11557db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11567db96d56Sopenharmony_ci| cp1255          | windows-1255                   | Hebrew                         |
11577db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11587db96d56Sopenharmony_ci| cp1256          | windows-1256                   | Arabic                         |
11597db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11607db96d56Sopenharmony_ci| cp1257          | windows-1257                   | Baltic languages               |
11617db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11627db96d56Sopenharmony_ci| cp1258          | windows-1258                   | Vietnamese                     |
11637db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11647db96d56Sopenharmony_ci| euc_jp          | eucjp, ujis, u-jis             | Japanese                       |
11657db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11667db96d56Sopenharmony_ci| euc_jis_2004    | jisx0213, eucjis2004           | Japanese                       |
11677db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11687db96d56Sopenharmony_ci| euc_jisx0213    | eucjisx0213                    | Japanese                       |
11697db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11707db96d56Sopenharmony_ci| euc_kr          | euckr, korean, ksc5601,        | Korean                         |
11717db96d56Sopenharmony_ci|                 | ks_c-5601, ks_c-5601-1987,     |                                |
11727db96d56Sopenharmony_ci|                 | ksx1001, ks_x-1001             |                                |
11737db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11747db96d56Sopenharmony_ci| gb2312          | chinese, csiso58gb231280,      | Simplified Chinese             |
11757db96d56Sopenharmony_ci|                 | euc-cn, euccn, eucgb2312-cn,   |                                |
11767db96d56Sopenharmony_ci|                 | gb2312-1980, gb2312-80,        |                                |
11777db96d56Sopenharmony_ci|                 | iso-ir-58                      |                                |
11787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11797db96d56Sopenharmony_ci| gbk             | 936, cp936, ms936              | Unified Chinese                |
11807db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11817db96d56Sopenharmony_ci| gb18030         | gb18030-2000                   | Unified Chinese                |
11827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11837db96d56Sopenharmony_ci| hz              | hzgb, hz-gb, hz-gb-2312        | Simplified Chinese             |
11847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11857db96d56Sopenharmony_ci| iso2022_jp      | csiso2022jp, iso2022jp,        | Japanese                       |
11867db96d56Sopenharmony_ci|                 | iso-2022-jp                    |                                |
11877db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11887db96d56Sopenharmony_ci| iso2022_jp_1    | iso2022jp-1, iso-2022-jp-1     | Japanese                       |
11897db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11907db96d56Sopenharmony_ci| iso2022_jp_2    | iso2022jp-2, iso-2022-jp-2     | Japanese, Korean, Simplified   |
11917db96d56Sopenharmony_ci|                 |                                | Chinese, Western Europe, Greek |
11927db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11937db96d56Sopenharmony_ci| iso2022_jp_2004 | iso2022jp-2004,                | Japanese                       |
11947db96d56Sopenharmony_ci|                 | iso-2022-jp-2004               |                                |
11957db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11967db96d56Sopenharmony_ci| iso2022_jp_3    | iso2022jp-3, iso-2022-jp-3     | Japanese                       |
11977db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
11987db96d56Sopenharmony_ci| iso2022_jp_ext  | iso2022jp-ext, iso-2022-jp-ext | Japanese                       |
11997db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12007db96d56Sopenharmony_ci| iso2022_kr      | csiso2022kr, iso2022kr,        | Korean                         |
12017db96d56Sopenharmony_ci|                 | iso-2022-kr                    |                                |
12027db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12037db96d56Sopenharmony_ci| latin_1         | iso-8859-1, iso8859-1, 8859,   | Western Europe                 |
12047db96d56Sopenharmony_ci|                 | cp819, latin, latin1, L1       |                                |
12057db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12067db96d56Sopenharmony_ci| iso8859_2       | iso-8859-2, latin2, L2         | Central and Eastern Europe     |
12077db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12087db96d56Sopenharmony_ci| iso8859_3       | iso-8859-3, latin3, L3         | Esperanto, Maltese             |
12097db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12107db96d56Sopenharmony_ci| iso8859_4       | iso-8859-4, latin4, L4         | Baltic languages               |
12117db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12127db96d56Sopenharmony_ci| iso8859_5       | iso-8859-5, cyrillic           | Bulgarian, Byelorussian,       |
12137db96d56Sopenharmony_ci|                 |                                | Macedonian, Russian, Serbian   |
12147db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12157db96d56Sopenharmony_ci| iso8859_6       | iso-8859-6, arabic             | Arabic                         |
12167db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12177db96d56Sopenharmony_ci| iso8859_7       | iso-8859-7, greek, greek8      | Greek                          |
12187db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12197db96d56Sopenharmony_ci| iso8859_8       | iso-8859-8, hebrew             | Hebrew                         |
12207db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12217db96d56Sopenharmony_ci| iso8859_9       | iso-8859-9, latin5, L5         | Turkish                        |
12227db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12237db96d56Sopenharmony_ci| iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               |
12247db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12257db96d56Sopenharmony_ci| iso8859_11      | iso-8859-11, thai              | Thai languages                 |
12267db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12277db96d56Sopenharmony_ci| iso8859_13      | iso-8859-13, latin7, L7        | Baltic languages               |
12287db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12297db96d56Sopenharmony_ci| iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               |
12307db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12317db96d56Sopenharmony_ci| iso8859_15      | iso-8859-15, latin9, L9        | Western Europe                 |
12327db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12337db96d56Sopenharmony_ci| iso8859_16      | iso-8859-16, latin10, L10      | South-Eastern Europe           |
12347db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12357db96d56Sopenharmony_ci| johab           | cp1361, ms1361                 | Korean                         |
12367db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12377db96d56Sopenharmony_ci| koi8_r          |                                | Russian                        |
12387db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12397db96d56Sopenharmony_ci| koi8_t          |                                | Tajik                          |
12407db96d56Sopenharmony_ci|                 |                                |                                |
12417db96d56Sopenharmony_ci|                 |                                | .. versionadded:: 3.5          |
12427db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12437db96d56Sopenharmony_ci| koi8_u          |                                | Ukrainian                      |
12447db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12457db96d56Sopenharmony_ci| kz1048          | kz_1048, strk1048_2002, rk1048 | Kazakh                         |
12467db96d56Sopenharmony_ci|                 |                                |                                |
12477db96d56Sopenharmony_ci|                 |                                | .. versionadded:: 3.5          |
12487db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12497db96d56Sopenharmony_ci| mac_cyrillic    | maccyrillic                    | Bulgarian, Byelorussian,       |
12507db96d56Sopenharmony_ci|                 |                                | Macedonian, Russian, Serbian   |
12517db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12527db96d56Sopenharmony_ci| mac_greek       | macgreek                       | Greek                          |
12537db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12547db96d56Sopenharmony_ci| mac_iceland     | maciceland                     | Icelandic                      |
12557db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12567db96d56Sopenharmony_ci| mac_latin2      | maclatin2, maccentraleurope,   | Central and Eastern Europe     |
12577db96d56Sopenharmony_ci|                 | mac_centeuro                   |                                |
12587db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12597db96d56Sopenharmony_ci| mac_roman       | macroman, macintosh            | Western Europe                 |
12607db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12617db96d56Sopenharmony_ci| mac_turkish     | macturkish                     | Turkish                        |
12627db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12637db96d56Sopenharmony_ci| ptcp154         | csptcp154, pt154, cp154,       | Kazakh                         |
12647db96d56Sopenharmony_ci|                 | cyrillic-asian                 |                                |
12657db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12667db96d56Sopenharmony_ci| shift_jis       | csshiftjis, shiftjis, sjis,    | Japanese                       |
12677db96d56Sopenharmony_ci|                 | s_jis                          |                                |
12687db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12697db96d56Sopenharmony_ci| shift_jis_2004  | shiftjis2004, sjis_2004,       | Japanese                       |
12707db96d56Sopenharmony_ci|                 | sjis2004                       |                                |
12717db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12727db96d56Sopenharmony_ci| shift_jisx0213  | shiftjisx0213, sjisx0213,      | Japanese                       |
12737db96d56Sopenharmony_ci|                 | s_jisx0213                     |                                |
12747db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12757db96d56Sopenharmony_ci| utf_32          | U32, utf32                     | all languages                  |
12767db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12777db96d56Sopenharmony_ci| utf_32_be       | UTF-32BE                       | all languages                  |
12787db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12797db96d56Sopenharmony_ci| utf_32_le       | UTF-32LE                       | all languages                  |
12807db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12817db96d56Sopenharmony_ci| utf_16          | U16, utf16                     | all languages                  |
12827db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12837db96d56Sopenharmony_ci| utf_16_be       | UTF-16BE                       | all languages                  |
12847db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12857db96d56Sopenharmony_ci| utf_16_le       | UTF-16LE                       | all languages                  |
12867db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12877db96d56Sopenharmony_ci| utf_7           | U7, unicode-1-1-utf-7          | all languages                  |
12887db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12897db96d56Sopenharmony_ci| utf_8           | U8, UTF, utf8, cp65001         | all languages                  |
12907db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12917db96d56Sopenharmony_ci| utf_8_sig       |                                | all languages                  |
12927db96d56Sopenharmony_ci+-----------------+--------------------------------+--------------------------------+
12937db96d56Sopenharmony_ci
12947db96d56Sopenharmony_ci.. versionchanged:: 3.4
12957db96d56Sopenharmony_ci   The utf-16\* and utf-32\* encoders no longer allow surrogate code points
12967db96d56Sopenharmony_ci   (``U+D800``--``U+DFFF``) to be encoded.
12977db96d56Sopenharmony_ci   The utf-32\* decoders no longer decode
12987db96d56Sopenharmony_ci   byte sequences that correspond to surrogate code points.
12997db96d56Sopenharmony_ci
13007db96d56Sopenharmony_ci.. versionchanged:: 3.8
13017db96d56Sopenharmony_ci   ``cp65001`` is now an alias to ``utf_8``.
13027db96d56Sopenharmony_ci
13037db96d56Sopenharmony_ci
13047db96d56Sopenharmony_ciPython Specific Encodings
13057db96d56Sopenharmony_ci-------------------------
13067db96d56Sopenharmony_ci
13077db96d56Sopenharmony_ciA number of predefined codecs are specific to Python, so their codec names have
13087db96d56Sopenharmony_cino meaning outside Python. These are listed in the tables below based on the
13097db96d56Sopenharmony_ciexpected input and output types (note that while text encodings are the most
13107db96d56Sopenharmony_cicommon use case for codecs, the underlying codec infrastructure supports
13117db96d56Sopenharmony_ciarbitrary data transforms rather than just text encodings). For asymmetric
13127db96d56Sopenharmony_cicodecs, the stated meaning describes the encoding direction.
13137db96d56Sopenharmony_ci
13147db96d56Sopenharmony_ciText Encodings
13157db96d56Sopenharmony_ci^^^^^^^^^^^^^^
13167db96d56Sopenharmony_ci
13177db96d56Sopenharmony_ciThe following codecs provide :class:`str` to :class:`bytes` encoding and
13187db96d56Sopenharmony_ci:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
13197db96d56Sopenharmony_ciencodings.
13207db96d56Sopenharmony_ci
13217db96d56Sopenharmony_ci.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
13227db96d56Sopenharmony_ci
13237db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13247db96d56Sopenharmony_ci| Codec              | Aliases | Meaning                   |
13257db96d56Sopenharmony_ci+====================+=========+===========================+
13267db96d56Sopenharmony_ci| idna               |         | Implement :rfc:`3490`,    |
13277db96d56Sopenharmony_ci|                    |         | see also                  |
13287db96d56Sopenharmony_ci|                    |         | :mod:`encodings.idna`.    |
13297db96d56Sopenharmony_ci|                    |         | Only ``errors='strict'``  |
13307db96d56Sopenharmony_ci|                    |         | is supported.             |
13317db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13327db96d56Sopenharmony_ci| mbcs               | ansi,   | Windows only: Encode the  |
13337db96d56Sopenharmony_ci|                    | dbcs    | operand according to the  |
13347db96d56Sopenharmony_ci|                    |         | ANSI codepage (CP_ACP).   |
13357db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13367db96d56Sopenharmony_ci| oem                |         | Windows only: Encode the  |
13377db96d56Sopenharmony_ci|                    |         | operand according to the  |
13387db96d56Sopenharmony_ci|                    |         | OEM codepage (CP_OEMCP).  |
13397db96d56Sopenharmony_ci|                    |         |                           |
13407db96d56Sopenharmony_ci|                    |         | .. versionadded:: 3.6     |
13417db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13427db96d56Sopenharmony_ci| palmos             |         | Encoding of PalmOS 3.5.   |
13437db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13447db96d56Sopenharmony_ci| punycode           |         | Implement :rfc:`3492`.    |
13457db96d56Sopenharmony_ci|                    |         | Stateful codecs are not   |
13467db96d56Sopenharmony_ci|                    |         | supported.                |
13477db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13487db96d56Sopenharmony_ci| raw_unicode_escape |         | Latin-1 encoding with     |
13497db96d56Sopenharmony_ci|                    |         | ``\uXXXX`` and            |
13507db96d56Sopenharmony_ci|                    |         | ``\UXXXXXXXX`` for other  |
13517db96d56Sopenharmony_ci|                    |         | code points. Existing     |
13527db96d56Sopenharmony_ci|                    |         | backslashes are not       |
13537db96d56Sopenharmony_ci|                    |         | escaped in any way.       |
13547db96d56Sopenharmony_ci|                    |         | It is used in the Python  |
13557db96d56Sopenharmony_ci|                    |         | pickle protocol.          |
13567db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13577db96d56Sopenharmony_ci| undefined          |         | Raise an exception for    |
13587db96d56Sopenharmony_ci|                    |         | all conversions, even     |
13597db96d56Sopenharmony_ci|                    |         | empty strings. The error  |
13607db96d56Sopenharmony_ci|                    |         | handler is ignored.       |
13617db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13627db96d56Sopenharmony_ci| unicode_escape     |         | Encoding suitable as the  |
13637db96d56Sopenharmony_ci|                    |         | contents of a Unicode     |
13647db96d56Sopenharmony_ci|                    |         | literal in ASCII-encoded  |
13657db96d56Sopenharmony_ci|                    |         | Python source code,       |
13667db96d56Sopenharmony_ci|                    |         | except that quotes are    |
13677db96d56Sopenharmony_ci|                    |         | not escaped. Decode       |
13687db96d56Sopenharmony_ci|                    |         | from Latin-1 source code. |
13697db96d56Sopenharmony_ci|                    |         | Beware that Python source |
13707db96d56Sopenharmony_ci|                    |         | code actually uses UTF-8  |
13717db96d56Sopenharmony_ci|                    |         | by default.               |
13727db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
13737db96d56Sopenharmony_ci
13747db96d56Sopenharmony_ci.. versionchanged:: 3.8
13757db96d56Sopenharmony_ci   "unicode_internal" codec is removed.
13767db96d56Sopenharmony_ci
13777db96d56Sopenharmony_ci
13787db96d56Sopenharmony_ci.. _binary-transforms:
13797db96d56Sopenharmony_ci
13807db96d56Sopenharmony_ciBinary Transforms
13817db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^
13827db96d56Sopenharmony_ci
13837db96d56Sopenharmony_ciThe following codecs provide binary transforms: :term:`bytes-like object`
13847db96d56Sopenharmony_cito :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`
13857db96d56Sopenharmony_ci(which only produces :class:`str` output).
13867db96d56Sopenharmony_ci
13877db96d56Sopenharmony_ci
13887db96d56Sopenharmony_ci.. tabularcolumns:: |l|L|L|L|
13897db96d56Sopenharmony_ci
13907db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
13917db96d56Sopenharmony_ci| Codec                | Aliases          | Meaning                      | Encoder / decoder            |
13927db96d56Sopenharmony_ci+======================+==================+==============================+==============================+
13937db96d56Sopenharmony_ci| base64_codec [#b64]_ | base64, base_64  | Convert the operand to       | :meth:`base64.encodebytes` / |
13947db96d56Sopenharmony_ci|                      |                  | multiline MIME base64 (the   | :meth:`base64.decodebytes`   |
13957db96d56Sopenharmony_ci|                      |                  | result always includes a     |                              |
13967db96d56Sopenharmony_ci|                      |                  | trailing ``'\n'``).          |                              |
13977db96d56Sopenharmony_ci|                      |                  |                              |                              |
13987db96d56Sopenharmony_ci|                      |                  | .. versionchanged:: 3.4      |                              |
13997db96d56Sopenharmony_ci|                      |                  |    accepts any               |                              |
14007db96d56Sopenharmony_ci|                      |                  |    :term:`bytes-like object` |                              |
14017db96d56Sopenharmony_ci|                      |                  |    as input for encoding and |                              |
14027db96d56Sopenharmony_ci|                      |                  |    decoding                  |                              |
14037db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14047db96d56Sopenharmony_ci| bz2_codec            | bz2              | Compress the operand using   | :meth:`bz2.compress` /       |
14057db96d56Sopenharmony_ci|                      |                  | bz2.                         | :meth:`bz2.decompress`       |
14067db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14077db96d56Sopenharmony_ci| hex_codec            | hex              | Convert the operand to       | :meth:`binascii.b2a_hex` /   |
14087db96d56Sopenharmony_ci|                      |                  | hexadecimal                  | :meth:`binascii.a2b_hex`     |
14097db96d56Sopenharmony_ci|                      |                  | representation, with two     |                              |
14107db96d56Sopenharmony_ci|                      |                  | digits per byte.             |                              |
14117db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14127db96d56Sopenharmony_ci| quopri_codec         | quopri,          | Convert the operand to MIME  | :meth:`quopri.encode` with   |
14137db96d56Sopenharmony_ci|                      | quotedprintable, | quoted printable.            | ``quotetabs=True`` /         |
14147db96d56Sopenharmony_ci|                      | quoted_printable |                              | :meth:`quopri.decode`        |
14157db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14167db96d56Sopenharmony_ci| uu_codec             | uu               | Convert the operand using    | :meth:`uu.encode` /          |
14177db96d56Sopenharmony_ci|                      |                  | uuencode.                    | :meth:`uu.decode`            |
14187db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14197db96d56Sopenharmony_ci| zlib_codec           | zip, zlib        | Compress the operand using   | :meth:`zlib.compress` /      |
14207db96d56Sopenharmony_ci|                      |                  | gzip.                        | :meth:`zlib.decompress`      |
14217db96d56Sopenharmony_ci+----------------------+------------------+------------------------------+------------------------------+
14227db96d56Sopenharmony_ci
14237db96d56Sopenharmony_ci.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,
14247db96d56Sopenharmony_ci   ``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for
14257db96d56Sopenharmony_ci   decoding
14267db96d56Sopenharmony_ci
14277db96d56Sopenharmony_ci.. versionadded:: 3.2
14287db96d56Sopenharmony_ci   Restoration of the binary transforms.
14297db96d56Sopenharmony_ci
14307db96d56Sopenharmony_ci.. versionchanged:: 3.4
14317db96d56Sopenharmony_ci   Restoration of the aliases for the binary transforms.
14327db96d56Sopenharmony_ci
14337db96d56Sopenharmony_ci
14347db96d56Sopenharmony_ci.. _text-transforms:
14357db96d56Sopenharmony_ci
14367db96d56Sopenharmony_ciText Transforms
14377db96d56Sopenharmony_ci^^^^^^^^^^^^^^^
14387db96d56Sopenharmony_ci
14397db96d56Sopenharmony_ciThe following codec provides a text transform: a :class:`str` to :class:`str`
14407db96d56Sopenharmony_cimapping. It is not supported by :meth:`str.encode` (which only produces
14417db96d56Sopenharmony_ci:class:`bytes` output).
14427db96d56Sopenharmony_ci
14437db96d56Sopenharmony_ci.. tabularcolumns:: |l|l|L|
14447db96d56Sopenharmony_ci
14457db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
14467db96d56Sopenharmony_ci| Codec              | Aliases | Meaning                   |
14477db96d56Sopenharmony_ci+====================+=========+===========================+
14487db96d56Sopenharmony_ci| rot_13             | rot13   | Return the Caesar-cypher  |
14497db96d56Sopenharmony_ci|                    |         | encryption of the         |
14507db96d56Sopenharmony_ci|                    |         | operand.                  |
14517db96d56Sopenharmony_ci+--------------------+---------+---------------------------+
14527db96d56Sopenharmony_ci
14537db96d56Sopenharmony_ci.. versionadded:: 3.2
14547db96d56Sopenharmony_ci   Restoration of the ``rot_13`` text transform.
14557db96d56Sopenharmony_ci
14567db96d56Sopenharmony_ci.. versionchanged:: 3.4
14577db96d56Sopenharmony_ci   Restoration of the ``rot13`` alias.
14587db96d56Sopenharmony_ci
14597db96d56Sopenharmony_ci
14607db96d56Sopenharmony_ci:mod:`encodings.idna` --- Internationalized Domain Names in Applications
14617db96d56Sopenharmony_ci------------------------------------------------------------------------
14627db96d56Sopenharmony_ci
14637db96d56Sopenharmony_ci.. module:: encodings.idna
14647db96d56Sopenharmony_ci   :synopsis: Internationalized Domain Names implementation
14657db96d56Sopenharmony_ci.. moduleauthor:: Martin v. Löwis
14667db96d56Sopenharmony_ci
14677db96d56Sopenharmony_ciThis module implements :rfc:`3490` (Internationalized Domain Names in
14687db96d56Sopenharmony_ciApplications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
14697db96d56Sopenharmony_ciInternationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
14707db96d56Sopenharmony_ciand :mod:`stringprep`.
14717db96d56Sopenharmony_ci
14727db96d56Sopenharmony_ciIf you need the IDNA 2008 standard from :rfc:`5891` and :rfc:`5895`, use the
14737db96d56Sopenharmony_cithird-party `idna module <https://pypi.org/project/idna/>`_.
14747db96d56Sopenharmony_ci
14757db96d56Sopenharmony_ciThese RFCs together define a protocol to support non-ASCII characters in domain
14767db96d56Sopenharmony_cinames. A domain name containing non-ASCII characters (such as
14777db96d56Sopenharmony_ci``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
14787db96d56Sopenharmony_ci(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
14797db96d56Sopenharmony_ciname is then used in all places where arbitrary characters are not allowed by
14807db96d56Sopenharmony_cithe protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
14817db96d56Sopenharmony_cion. This conversion is carried out in the application; if possible invisible to
14827db96d56Sopenharmony_cithe user: The application should transparently convert Unicode domain labels to
14837db96d56Sopenharmony_ciIDNA on the wire, and convert back ACE labels to Unicode before presenting them
14847db96d56Sopenharmony_cito the user.
14857db96d56Sopenharmony_ci
14867db96d56Sopenharmony_ciPython supports this conversion in several ways:  the ``idna`` codec performs
14877db96d56Sopenharmony_ciconversion between Unicode and ACE, separating an input string into labels
14887db96d56Sopenharmony_cibased on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>`
14897db96d56Sopenharmony_ciand converting each label to ACE as required, and conversely separating an input
14907db96d56Sopenharmony_cibyte string into labels based on the ``.`` separator and converting any ACE
14917db96d56Sopenharmony_cilabels found into unicode. Furthermore, the :mod:`socket` module
14927db96d56Sopenharmony_citransparently converts Unicode host names to ACE, so that applications need not
14937db96d56Sopenharmony_cibe concerned about converting host names themselves when they pass them to the
14947db96d56Sopenharmony_cisocket module. On top of that, modules that have host names as function
14957db96d56Sopenharmony_ciparameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host
14967db96d56Sopenharmony_cinames (:mod:`http.client` then also transparently sends an IDNA hostname in the
14977db96d56Sopenharmony_ci:mailheader:`Host` field if it sends that field at all).
14987db96d56Sopenharmony_ci
14997db96d56Sopenharmony_ciWhen receiving host names from the wire (such as in reverse name lookup), no
15007db96d56Sopenharmony_ciautomatic conversion to Unicode is performed: applications wishing to present
15017db96d56Sopenharmony_cisuch host names to the user should decode them to Unicode.
15027db96d56Sopenharmony_ci
15037db96d56Sopenharmony_ciThe module :mod:`encodings.idna` also implements the nameprep procedure, which
15047db96d56Sopenharmony_ciperforms certain normalizations on host names, to achieve case-insensitivity of
15057db96d56Sopenharmony_ciinternational domain names, and to unify similar characters. The nameprep
15067db96d56Sopenharmony_cifunctions can be used directly if desired.
15077db96d56Sopenharmony_ci
15087db96d56Sopenharmony_ci
15097db96d56Sopenharmony_ci.. function:: nameprep(label)
15107db96d56Sopenharmony_ci
15117db96d56Sopenharmony_ci   Return the nameprepped version of *label*. The implementation currently assumes
15127db96d56Sopenharmony_ci   query strings, so ``AllowUnassigned`` is true.
15137db96d56Sopenharmony_ci
15147db96d56Sopenharmony_ci
15157db96d56Sopenharmony_ci.. function:: ToASCII(label)
15167db96d56Sopenharmony_ci
15177db96d56Sopenharmony_ci   Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
15187db96d56Sopenharmony_ci   assumed to be false.
15197db96d56Sopenharmony_ci
15207db96d56Sopenharmony_ci
15217db96d56Sopenharmony_ci.. function:: ToUnicode(label)
15227db96d56Sopenharmony_ci
15237db96d56Sopenharmony_ci   Convert a label to Unicode, as specified in :rfc:`3490`.
15247db96d56Sopenharmony_ci
15257db96d56Sopenharmony_ci
15267db96d56Sopenharmony_ci:mod:`encodings.mbcs` --- Windows ANSI codepage
15277db96d56Sopenharmony_ci-----------------------------------------------
15287db96d56Sopenharmony_ci
15297db96d56Sopenharmony_ci.. module:: encodings.mbcs
15307db96d56Sopenharmony_ci   :synopsis: Windows ANSI codepage
15317db96d56Sopenharmony_ci
15327db96d56Sopenharmony_ciThis module implements the ANSI codepage (CP_ACP).
15337db96d56Sopenharmony_ci
15347db96d56Sopenharmony_ci.. availability:: Windows.
15357db96d56Sopenharmony_ci
15367db96d56Sopenharmony_ci.. versionchanged:: 3.3
15377db96d56Sopenharmony_ci   Support any error handler.
15387db96d56Sopenharmony_ci
15397db96d56Sopenharmony_ci.. versionchanged:: 3.2
15407db96d56Sopenharmony_ci   Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
15417db96d56Sopenharmony_ci   to encode, and ``'ignore'`` to decode.
15427db96d56Sopenharmony_ci
15437db96d56Sopenharmony_ci
15447db96d56Sopenharmony_ci:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
15457db96d56Sopenharmony_ci-------------------------------------------------------------
15467db96d56Sopenharmony_ci
15477db96d56Sopenharmony_ci.. module:: encodings.utf_8_sig
15487db96d56Sopenharmony_ci   :synopsis: UTF-8 codec with BOM signature
15497db96d56Sopenharmony_ci.. moduleauthor:: Walter Dörwald
15507db96d56Sopenharmony_ci
15517db96d56Sopenharmony_ciThis module implements a variant of the UTF-8 codec. On encoding, a UTF-8 encoded
15527db96d56Sopenharmony_ciBOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
15537db96d56Sopenharmony_ciis only done once (on the first write to the byte stream). On decoding, an
15547db96d56Sopenharmony_cioptional UTF-8 encoded BOM at the start of the data will be skipped.
1555