17db96d56Sopenharmony_ci:mod:`email.charset`: Representing character sets 27db96d56Sopenharmony_ci------------------------------------------------- 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: email.charset 57db96d56Sopenharmony_ci :synopsis: Character Sets 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci**Source code:** :source:`Lib/email/charset.py` 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ci-------------- 107db96d56Sopenharmony_ci 117db96d56Sopenharmony_ciThis module is part of the legacy (``Compat32``) email API. In the new 127db96d56Sopenharmony_ciAPI only the aliases table is used. 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ciThe remaining text in this section is the original documentation of the module. 157db96d56Sopenharmony_ci 167db96d56Sopenharmony_ciThis module provides a class :class:`Charset` for representing character sets 177db96d56Sopenharmony_ciand character set conversions in email messages, as well as a character set 187db96d56Sopenharmony_ciregistry and several convenience methods for manipulating this registry. 197db96d56Sopenharmony_ciInstances of :class:`Charset` are used in several other modules within the 207db96d56Sopenharmony_ci:mod:`email` package. 217db96d56Sopenharmony_ci 227db96d56Sopenharmony_ciImport this class from the :mod:`email.charset` module. 237db96d56Sopenharmony_ci 247db96d56Sopenharmony_ci 257db96d56Sopenharmony_ci.. class:: Charset(input_charset=DEFAULT_CHARSET) 267db96d56Sopenharmony_ci 277db96d56Sopenharmony_ci Map character sets to their email properties. 287db96d56Sopenharmony_ci 297db96d56Sopenharmony_ci This class provides information about the requirements imposed on email for a 307db96d56Sopenharmony_ci specific character set. It also provides convenience routines for converting 317db96d56Sopenharmony_ci between character sets, given the availability of the applicable codecs. Given 327db96d56Sopenharmony_ci a character set, it will do its best to provide information on how to use that 337db96d56Sopenharmony_ci character set in an email message in an RFC-compliant way. 347db96d56Sopenharmony_ci 357db96d56Sopenharmony_ci Certain character sets must be encoded with quoted-printable or base64 when used 367db96d56Sopenharmony_ci in email headers or bodies. Certain character sets must be converted outright, 377db96d56Sopenharmony_ci and are not allowed in email. 387db96d56Sopenharmony_ci 397db96d56Sopenharmony_ci Optional *input_charset* is as described below; it is always coerced to lower 407db96d56Sopenharmony_ci case. After being alias normalized it is also used as a lookup into the 417db96d56Sopenharmony_ci registry of character sets to find out the header encoding, body encoding, and 427db96d56Sopenharmony_ci output conversion codec to be used for the character set. For example, if 437db96d56Sopenharmony_ci *input_charset* is ``iso-8859-1``, then headers and bodies will be encoded using 447db96d56Sopenharmony_ci quoted-printable and no output conversion codec is necessary. If 457db96d56Sopenharmony_ci *input_charset* is ``euc-jp``, then headers will be encoded with base64, bodies 467db96d56Sopenharmony_ci will not be encoded, but output text will be converted from the ``euc-jp`` 477db96d56Sopenharmony_ci character set to the ``iso-2022-jp`` character set. 487db96d56Sopenharmony_ci 497db96d56Sopenharmony_ci :class:`Charset` instances have the following data attributes: 507db96d56Sopenharmony_ci 517db96d56Sopenharmony_ci .. attribute:: input_charset 527db96d56Sopenharmony_ci 537db96d56Sopenharmony_ci The initial character set specified. Common aliases are converted to 547db96d56Sopenharmony_ci their *official* email names (e.g. ``latin_1`` is converted to 557db96d56Sopenharmony_ci ``iso-8859-1``). Defaults to 7-bit ``us-ascii``. 567db96d56Sopenharmony_ci 577db96d56Sopenharmony_ci 587db96d56Sopenharmony_ci .. attribute:: header_encoding 597db96d56Sopenharmony_ci 607db96d56Sopenharmony_ci If the character set must be encoded before it can be used in an email 617db96d56Sopenharmony_ci header, this attribute will be set to ``Charset.QP`` (for 627db96d56Sopenharmony_ci quoted-printable), ``Charset.BASE64`` (for base64 encoding), or 637db96d56Sopenharmony_ci ``Charset.SHORTEST`` for the shortest of QP or BASE64 encoding. Otherwise, 647db96d56Sopenharmony_ci it will be ``None``. 657db96d56Sopenharmony_ci 667db96d56Sopenharmony_ci 677db96d56Sopenharmony_ci .. attribute:: body_encoding 687db96d56Sopenharmony_ci 697db96d56Sopenharmony_ci Same as *header_encoding*, but describes the encoding for the mail 707db96d56Sopenharmony_ci message's body, which indeed may be different than the header encoding. 717db96d56Sopenharmony_ci ``Charset.SHORTEST`` is not allowed for *body_encoding*. 727db96d56Sopenharmony_ci 737db96d56Sopenharmony_ci 747db96d56Sopenharmony_ci .. attribute:: output_charset 757db96d56Sopenharmony_ci 767db96d56Sopenharmony_ci Some character sets must be converted before they can be used in email 777db96d56Sopenharmony_ci headers or bodies. If the *input_charset* is one of them, this attribute 787db96d56Sopenharmony_ci will contain the name of the character set output will be converted to. 797db96d56Sopenharmony_ci Otherwise, it will be ``None``. 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci 827db96d56Sopenharmony_ci .. attribute:: input_codec 837db96d56Sopenharmony_ci 847db96d56Sopenharmony_ci The name of the Python codec used to convert the *input_charset* to 857db96d56Sopenharmony_ci Unicode. If no conversion codec is necessary, this attribute will be 867db96d56Sopenharmony_ci ``None``. 877db96d56Sopenharmony_ci 887db96d56Sopenharmony_ci 897db96d56Sopenharmony_ci .. attribute:: output_codec 907db96d56Sopenharmony_ci 917db96d56Sopenharmony_ci The name of the Python codec used to convert Unicode to the 927db96d56Sopenharmony_ci *output_charset*. If no conversion codec is necessary, this attribute 937db96d56Sopenharmony_ci will have the same value as the *input_codec*. 947db96d56Sopenharmony_ci 957db96d56Sopenharmony_ci 967db96d56Sopenharmony_ci :class:`Charset` instances also have the following methods: 977db96d56Sopenharmony_ci 987db96d56Sopenharmony_ci .. method:: get_body_encoding() 997db96d56Sopenharmony_ci 1007db96d56Sopenharmony_ci Return the content transfer encoding used for body encoding. 1017db96d56Sopenharmony_ci 1027db96d56Sopenharmony_ci This is either the string ``quoted-printable`` or ``base64`` depending on 1037db96d56Sopenharmony_ci the encoding used, or it is a function, in which case you should call the 1047db96d56Sopenharmony_ci function with a single argument, the Message object being encoded. The 1057db96d56Sopenharmony_ci function should then set the :mailheader:`Content-Transfer-Encoding` 1067db96d56Sopenharmony_ci header itself to whatever is appropriate. 1077db96d56Sopenharmony_ci 1087db96d56Sopenharmony_ci Returns the string ``quoted-printable`` if *body_encoding* is ``QP``, 1097db96d56Sopenharmony_ci returns the string ``base64`` if *body_encoding* is ``BASE64``, and 1107db96d56Sopenharmony_ci returns the string ``7bit`` otherwise. 1117db96d56Sopenharmony_ci 1127db96d56Sopenharmony_ci 1137db96d56Sopenharmony_ci .. method:: get_output_charset() 1147db96d56Sopenharmony_ci 1157db96d56Sopenharmony_ci Return the output character set. 1167db96d56Sopenharmony_ci 1177db96d56Sopenharmony_ci This is the *output_charset* attribute if that is not ``None``, otherwise 1187db96d56Sopenharmony_ci it is *input_charset*. 1197db96d56Sopenharmony_ci 1207db96d56Sopenharmony_ci 1217db96d56Sopenharmony_ci .. method:: header_encode(string) 1227db96d56Sopenharmony_ci 1237db96d56Sopenharmony_ci Header-encode the string *string*. 1247db96d56Sopenharmony_ci 1257db96d56Sopenharmony_ci The type of encoding (base64 or quoted-printable) will be based on the 1267db96d56Sopenharmony_ci *header_encoding* attribute. 1277db96d56Sopenharmony_ci 1287db96d56Sopenharmony_ci 1297db96d56Sopenharmony_ci .. method:: header_encode_lines(string, maxlengths) 1307db96d56Sopenharmony_ci 1317db96d56Sopenharmony_ci Header-encode a *string* by converting it first to bytes. 1327db96d56Sopenharmony_ci 1337db96d56Sopenharmony_ci This is similar to :meth:`header_encode` except that the string is fit 1347db96d56Sopenharmony_ci into maximum line lengths as given by the argument *maxlengths*, which 1357db96d56Sopenharmony_ci must be an iterator: each element returned from this iterator will provide 1367db96d56Sopenharmony_ci the next maximum line length. 1377db96d56Sopenharmony_ci 1387db96d56Sopenharmony_ci 1397db96d56Sopenharmony_ci .. method:: body_encode(string) 1407db96d56Sopenharmony_ci 1417db96d56Sopenharmony_ci Body-encode the string *string*. 1427db96d56Sopenharmony_ci 1437db96d56Sopenharmony_ci The type of encoding (base64 or quoted-printable) will be based on the 1447db96d56Sopenharmony_ci *body_encoding* attribute. 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci The :class:`Charset` class also provides a number of methods to support 1477db96d56Sopenharmony_ci standard operations and built-in functions. 1487db96d56Sopenharmony_ci 1497db96d56Sopenharmony_ci 1507db96d56Sopenharmony_ci .. method:: __str__() 1517db96d56Sopenharmony_ci 1527db96d56Sopenharmony_ci Returns *input_charset* as a string coerced to lower 1537db96d56Sopenharmony_ci case. :meth:`__repr__` is an alias for :meth:`__str__`. 1547db96d56Sopenharmony_ci 1557db96d56Sopenharmony_ci 1567db96d56Sopenharmony_ci .. method:: __eq__(other) 1577db96d56Sopenharmony_ci 1587db96d56Sopenharmony_ci This method allows you to compare two :class:`Charset` instances for 1597db96d56Sopenharmony_ci equality. 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci 1627db96d56Sopenharmony_ci .. method:: __ne__(other) 1637db96d56Sopenharmony_ci 1647db96d56Sopenharmony_ci This method allows you to compare two :class:`Charset` instances for 1657db96d56Sopenharmony_ci inequality. 1667db96d56Sopenharmony_ci 1677db96d56Sopenharmony_ciThe :mod:`email.charset` module also provides the following functions for adding 1687db96d56Sopenharmony_cinew entries to the global character set, alias, and codec registries: 1697db96d56Sopenharmony_ci 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci.. function:: add_charset(charset, header_enc=None, body_enc=None, output_charset=None) 1727db96d56Sopenharmony_ci 1737db96d56Sopenharmony_ci Add character properties to the global registry. 1747db96d56Sopenharmony_ci 1757db96d56Sopenharmony_ci *charset* is the input character set, and must be the canonical name of a 1767db96d56Sopenharmony_ci character set. 1777db96d56Sopenharmony_ci 1787db96d56Sopenharmony_ci Optional *header_enc* and *body_enc* is either ``Charset.QP`` for 1797db96d56Sopenharmony_ci quoted-printable, ``Charset.BASE64`` for base64 encoding, 1807db96d56Sopenharmony_ci ``Charset.SHORTEST`` for the shortest of quoted-printable or base64 encoding, 1817db96d56Sopenharmony_ci or ``None`` for no encoding. ``SHORTEST`` is only valid for 1827db96d56Sopenharmony_ci *header_enc*. The default is ``None`` for no encoding. 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci Optional *output_charset* is the character set that the output should be in. 1857db96d56Sopenharmony_ci Conversions will proceed from input charset, to Unicode, to the output charset 1867db96d56Sopenharmony_ci when the method :meth:`Charset.convert` is called. The default is to output in 1877db96d56Sopenharmony_ci the same character set as the input. 1887db96d56Sopenharmony_ci 1897db96d56Sopenharmony_ci Both *input_charset* and *output_charset* must have Unicode codec entries in the 1907db96d56Sopenharmony_ci module's character set-to-codec mapping; use :func:`add_codec` to add codecs the 1917db96d56Sopenharmony_ci module does not know about. See the :mod:`codecs` module's documentation for 1927db96d56Sopenharmony_ci more information. 1937db96d56Sopenharmony_ci 1947db96d56Sopenharmony_ci The global character set registry is kept in the module global dictionary 1957db96d56Sopenharmony_ci ``CHARSETS``. 1967db96d56Sopenharmony_ci 1977db96d56Sopenharmony_ci 1987db96d56Sopenharmony_ci.. function:: add_alias(alias, canonical) 1997db96d56Sopenharmony_ci 2007db96d56Sopenharmony_ci Add a character set alias. *alias* is the alias name, e.g. ``latin-1``. 2017db96d56Sopenharmony_ci *canonical* is the character set's canonical name, e.g. ``iso-8859-1``. 2027db96d56Sopenharmony_ci 2037db96d56Sopenharmony_ci The global charset alias registry is kept in the module global dictionary 2047db96d56Sopenharmony_ci ``ALIASES``. 2057db96d56Sopenharmony_ci 2067db96d56Sopenharmony_ci 2077db96d56Sopenharmony_ci.. function:: add_codec(charset, codecname) 2087db96d56Sopenharmony_ci 2097db96d56Sopenharmony_ci Add a codec that map characters in the given character set to and from Unicode. 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ci *charset* is the canonical name of a character set. *codecname* is the name of a 2127db96d56Sopenharmony_ci Python codec, as appropriate for the second argument to the :class:`str`'s 2137db96d56Sopenharmony_ci :meth:`~str.encode` method. 2147db96d56Sopenharmony_ci 215