17db96d56Sopenharmony_ci:mod:`html.parser` --- Simple HTML and XHTML parser
27db96d56Sopenharmony_ci===================================================
37db96d56Sopenharmony_ci
47db96d56Sopenharmony_ci.. module:: html.parser
57db96d56Sopenharmony_ci   :synopsis: A simple parser that can handle HTML and XHTML.
67db96d56Sopenharmony_ci
77db96d56Sopenharmony_ci**Source code:** :source:`Lib/html/parser.py`
87db96d56Sopenharmony_ci
97db96d56Sopenharmony_ci.. index::
107db96d56Sopenharmony_ci   single: HTML
117db96d56Sopenharmony_ci   single: XHTML
127db96d56Sopenharmony_ci
137db96d56Sopenharmony_ci--------------
147db96d56Sopenharmony_ci
157db96d56Sopenharmony_ciThis module defines a class :class:`HTMLParser` which serves as the basis for
167db96d56Sopenharmony_ciparsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
177db96d56Sopenharmony_ci
187db96d56Sopenharmony_ci.. class:: HTMLParser(*, convert_charrefs=True)
197db96d56Sopenharmony_ci
207db96d56Sopenharmony_ci   Create a parser instance able to parse invalid markup.
217db96d56Sopenharmony_ci
227db96d56Sopenharmony_ci   If *convert_charrefs* is ``True`` (the default), all character
237db96d56Sopenharmony_ci   references (except the ones in ``script``/``style`` elements) are
247db96d56Sopenharmony_ci   automatically converted to the corresponding Unicode characters.
257db96d56Sopenharmony_ci
267db96d56Sopenharmony_ci   An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
277db96d56Sopenharmony_ci   when start tags, end tags, text, comments, and other markup elements are
287db96d56Sopenharmony_ci   encountered.  The user should subclass :class:`.HTMLParser` and override its
297db96d56Sopenharmony_ci   methods to implement the desired behavior.
307db96d56Sopenharmony_ci
317db96d56Sopenharmony_ci   This parser does not check that end tags match start tags or call the end-tag
327db96d56Sopenharmony_ci   handler for elements which are closed implicitly by closing an outer element.
337db96d56Sopenharmony_ci
347db96d56Sopenharmony_ci   .. versionchanged:: 3.4
357db96d56Sopenharmony_ci      *convert_charrefs* keyword argument added.
367db96d56Sopenharmony_ci
377db96d56Sopenharmony_ci   .. versionchanged:: 3.5
387db96d56Sopenharmony_ci      The default value for argument *convert_charrefs* is now ``True``.
397db96d56Sopenharmony_ci
407db96d56Sopenharmony_ci
417db96d56Sopenharmony_ciExample HTML Parser Application
427db96d56Sopenharmony_ci-------------------------------
437db96d56Sopenharmony_ci
447db96d56Sopenharmony_ciAs a basic example, below is a simple HTML parser that uses the
457db96d56Sopenharmony_ci:class:`HTMLParser` class to print out start tags, end tags, and data
467db96d56Sopenharmony_cias they are encountered::
477db96d56Sopenharmony_ci
487db96d56Sopenharmony_ci   from html.parser import HTMLParser
497db96d56Sopenharmony_ci
507db96d56Sopenharmony_ci   class MyHTMLParser(HTMLParser):
517db96d56Sopenharmony_ci       def handle_starttag(self, tag, attrs):
527db96d56Sopenharmony_ci           print("Encountered a start tag:", tag)
537db96d56Sopenharmony_ci
547db96d56Sopenharmony_ci       def handle_endtag(self, tag):
557db96d56Sopenharmony_ci           print("Encountered an end tag :", tag)
567db96d56Sopenharmony_ci
577db96d56Sopenharmony_ci       def handle_data(self, data):
587db96d56Sopenharmony_ci           print("Encountered some data  :", data)
597db96d56Sopenharmony_ci
607db96d56Sopenharmony_ci   parser = MyHTMLParser()
617db96d56Sopenharmony_ci   parser.feed('<html><head><title>Test</title></head>'
627db96d56Sopenharmony_ci               '<body><h1>Parse me!</h1></body></html>')
637db96d56Sopenharmony_ci
647db96d56Sopenharmony_ciThe output will then be:
657db96d56Sopenharmony_ci
667db96d56Sopenharmony_ci.. code-block:: none
677db96d56Sopenharmony_ci
687db96d56Sopenharmony_ci   Encountered a start tag: html
697db96d56Sopenharmony_ci   Encountered a start tag: head
707db96d56Sopenharmony_ci   Encountered a start tag: title
717db96d56Sopenharmony_ci   Encountered some data  : Test
727db96d56Sopenharmony_ci   Encountered an end tag : title
737db96d56Sopenharmony_ci   Encountered an end tag : head
747db96d56Sopenharmony_ci   Encountered a start tag: body
757db96d56Sopenharmony_ci   Encountered a start tag: h1
767db96d56Sopenharmony_ci   Encountered some data  : Parse me!
777db96d56Sopenharmony_ci   Encountered an end tag : h1
787db96d56Sopenharmony_ci   Encountered an end tag : body
797db96d56Sopenharmony_ci   Encountered an end tag : html
807db96d56Sopenharmony_ci
817db96d56Sopenharmony_ci
827db96d56Sopenharmony_ci:class:`.HTMLParser` Methods
837db96d56Sopenharmony_ci----------------------------
847db96d56Sopenharmony_ci
857db96d56Sopenharmony_ci:class:`HTMLParser` instances have the following methods:
867db96d56Sopenharmony_ci
877db96d56Sopenharmony_ci
887db96d56Sopenharmony_ci.. method:: HTMLParser.feed(data)
897db96d56Sopenharmony_ci
907db96d56Sopenharmony_ci   Feed some text to the parser.  It is processed insofar as it consists of
917db96d56Sopenharmony_ci   complete elements; incomplete data is buffered until more data is fed or
927db96d56Sopenharmony_ci   :meth:`close` is called.  *data* must be :class:`str`.
937db96d56Sopenharmony_ci
947db96d56Sopenharmony_ci
957db96d56Sopenharmony_ci.. method:: HTMLParser.close()
967db96d56Sopenharmony_ci
977db96d56Sopenharmony_ci   Force processing of all buffered data as if it were followed by an end-of-file
987db96d56Sopenharmony_ci   mark.  This method may be redefined by a derived class to define additional
997db96d56Sopenharmony_ci   processing at the end of the input, but the redefined version should always call
1007db96d56Sopenharmony_ci   the :class:`HTMLParser` base class method :meth:`close`.
1017db96d56Sopenharmony_ci
1027db96d56Sopenharmony_ci
1037db96d56Sopenharmony_ci.. method:: HTMLParser.reset()
1047db96d56Sopenharmony_ci
1057db96d56Sopenharmony_ci   Reset the instance.  Loses all unprocessed data.  This is called implicitly at
1067db96d56Sopenharmony_ci   instantiation time.
1077db96d56Sopenharmony_ci
1087db96d56Sopenharmony_ci
1097db96d56Sopenharmony_ci.. method:: HTMLParser.getpos()
1107db96d56Sopenharmony_ci
1117db96d56Sopenharmony_ci   Return current line number and offset.
1127db96d56Sopenharmony_ci
1137db96d56Sopenharmony_ci
1147db96d56Sopenharmony_ci.. method:: HTMLParser.get_starttag_text()
1157db96d56Sopenharmony_ci
1167db96d56Sopenharmony_ci   Return the text of the most recently opened start tag.  This should not normally
1177db96d56Sopenharmony_ci   be needed for structured processing, but may be useful in dealing with HTML "as
1187db96d56Sopenharmony_ci   deployed" or for re-generating input with minimal changes (whitespace between
1197db96d56Sopenharmony_ci   attributes can be preserved, etc.).
1207db96d56Sopenharmony_ci
1217db96d56Sopenharmony_ci
1227db96d56Sopenharmony_ciThe following methods are called when data or markup elements are encountered
1237db96d56Sopenharmony_ciand they are meant to be overridden in a subclass.  The base class
1247db96d56Sopenharmony_ciimplementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
1257db96d56Sopenharmony_ci
1267db96d56Sopenharmony_ci
1277db96d56Sopenharmony_ci.. method:: HTMLParser.handle_starttag(tag, attrs)
1287db96d56Sopenharmony_ci
1297db96d56Sopenharmony_ci   This method is called to handle the start tag of an element (e.g. ``<div id="main">``).
1307db96d56Sopenharmony_ci
1317db96d56Sopenharmony_ci   The *tag* argument is the name of the tag converted to lower case. The *attrs*
1327db96d56Sopenharmony_ci   argument is a list of ``(name, value)`` pairs containing the attributes found
1337db96d56Sopenharmony_ci   inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
1347db96d56Sopenharmony_ci   and quotes in the *value* have been removed, and character and entity references
1357db96d56Sopenharmony_ci   have been replaced.
1367db96d56Sopenharmony_ci
1377db96d56Sopenharmony_ci   For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
1387db96d56Sopenharmony_ci   would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
1397db96d56Sopenharmony_ci
1407db96d56Sopenharmony_ci   All entity references from :mod:`html.entities` are replaced in the attribute
1417db96d56Sopenharmony_ci   values.
1427db96d56Sopenharmony_ci
1437db96d56Sopenharmony_ci
1447db96d56Sopenharmony_ci.. method:: HTMLParser.handle_endtag(tag)
1457db96d56Sopenharmony_ci
1467db96d56Sopenharmony_ci   This method is called to handle the end tag of an element (e.g. ``</div>``).
1477db96d56Sopenharmony_ci
1487db96d56Sopenharmony_ci   The *tag* argument is the name of the tag converted to lower case.
1497db96d56Sopenharmony_ci
1507db96d56Sopenharmony_ci
1517db96d56Sopenharmony_ci.. method:: HTMLParser.handle_startendtag(tag, attrs)
1527db96d56Sopenharmony_ci
1537db96d56Sopenharmony_ci   Similar to :meth:`handle_starttag`, but called when the parser encounters an
1547db96d56Sopenharmony_ci   XHTML-style empty tag (``<img ... />``).  This method may be overridden by
1557db96d56Sopenharmony_ci   subclasses which require this particular lexical information; the default
1567db96d56Sopenharmony_ci   implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
1577db96d56Sopenharmony_ci
1587db96d56Sopenharmony_ci
1597db96d56Sopenharmony_ci.. method:: HTMLParser.handle_data(data)
1607db96d56Sopenharmony_ci
1617db96d56Sopenharmony_ci   This method is called to process arbitrary data (e.g. text nodes and the
1627db96d56Sopenharmony_ci   content of ``<script>...</script>`` and ``<style>...</style>``).
1637db96d56Sopenharmony_ci
1647db96d56Sopenharmony_ci
1657db96d56Sopenharmony_ci.. method:: HTMLParser.handle_entityref(name)
1667db96d56Sopenharmony_ci
1677db96d56Sopenharmony_ci   This method is called to process a named character reference of the form
1687db96d56Sopenharmony_ci   ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
1697db96d56Sopenharmony_ci   (e.g. ``'gt'``).  This method is never called if *convert_charrefs* is
1707db96d56Sopenharmony_ci   ``True``.
1717db96d56Sopenharmony_ci
1727db96d56Sopenharmony_ci
1737db96d56Sopenharmony_ci.. method:: HTMLParser.handle_charref(name)
1747db96d56Sopenharmony_ci
1757db96d56Sopenharmony_ci   This method is called to process decimal and hexadecimal numeric character
1767db96d56Sopenharmony_ci   references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
1777db96d56Sopenharmony_ci   equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
1787db96d56Sopenharmony_ci   in this case the method will receive ``'62'`` or ``'x3E'``.  This method
1797db96d56Sopenharmony_ci   is never called if *convert_charrefs* is ``True``.
1807db96d56Sopenharmony_ci
1817db96d56Sopenharmony_ci
1827db96d56Sopenharmony_ci.. method:: HTMLParser.handle_comment(data)
1837db96d56Sopenharmony_ci
1847db96d56Sopenharmony_ci   This method is called when a comment is encountered (e.g. ``<!--comment-->``).
1857db96d56Sopenharmony_ci
1867db96d56Sopenharmony_ci   For example, the comment ``<!-- comment -->`` will cause this method to be
1877db96d56Sopenharmony_ci   called with the argument ``' comment '``.
1887db96d56Sopenharmony_ci
1897db96d56Sopenharmony_ci   The content of Internet Explorer conditional comments (condcoms) will also be
1907db96d56Sopenharmony_ci   sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
1917db96d56Sopenharmony_ci   this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
1927db96d56Sopenharmony_ci
1937db96d56Sopenharmony_ci
1947db96d56Sopenharmony_ci.. method:: HTMLParser.handle_decl(decl)
1957db96d56Sopenharmony_ci
1967db96d56Sopenharmony_ci   This method is called to handle an HTML doctype declaration (e.g.
1977db96d56Sopenharmony_ci   ``<!DOCTYPE html>``).
1987db96d56Sopenharmony_ci
1997db96d56Sopenharmony_ci   The *decl* parameter will be the entire contents of the declaration inside
2007db96d56Sopenharmony_ci   the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
2017db96d56Sopenharmony_ci
2027db96d56Sopenharmony_ci
2037db96d56Sopenharmony_ci.. method:: HTMLParser.handle_pi(data)
2047db96d56Sopenharmony_ci
2057db96d56Sopenharmony_ci   Method called when a processing instruction is encountered.  The *data*
2067db96d56Sopenharmony_ci   parameter will contain the entire processing instruction. For example, for the
2077db96d56Sopenharmony_ci   processing instruction ``<?proc color='red'>``, this method would be called as
2087db96d56Sopenharmony_ci   ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
2097db96d56Sopenharmony_ci   class; the base class implementation does nothing.
2107db96d56Sopenharmony_ci
2117db96d56Sopenharmony_ci   .. note::
2127db96d56Sopenharmony_ci
2137db96d56Sopenharmony_ci      The :class:`HTMLParser` class uses the SGML syntactic rules for processing
2147db96d56Sopenharmony_ci      instructions.  An XHTML processing instruction using the trailing ``'?'`` will
2157db96d56Sopenharmony_ci      cause the ``'?'`` to be included in *data*.
2167db96d56Sopenharmony_ci
2177db96d56Sopenharmony_ci
2187db96d56Sopenharmony_ci.. method:: HTMLParser.unknown_decl(data)
2197db96d56Sopenharmony_ci
2207db96d56Sopenharmony_ci   This method is called when an unrecognized declaration is read by the parser.
2217db96d56Sopenharmony_ci
2227db96d56Sopenharmony_ci   The *data* parameter will be the entire contents of the declaration inside
2237db96d56Sopenharmony_ci   the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
2247db96d56Sopenharmony_ci   derived class.  The base class implementation does nothing.
2257db96d56Sopenharmony_ci
2267db96d56Sopenharmony_ci
2277db96d56Sopenharmony_ci.. _htmlparser-examples:
2287db96d56Sopenharmony_ci
2297db96d56Sopenharmony_ciExamples
2307db96d56Sopenharmony_ci--------
2317db96d56Sopenharmony_ci
2327db96d56Sopenharmony_ciThe following class implements a parser that will be used to illustrate more
2337db96d56Sopenharmony_ciexamples::
2347db96d56Sopenharmony_ci
2357db96d56Sopenharmony_ci   from html.parser import HTMLParser
2367db96d56Sopenharmony_ci   from html.entities import name2codepoint
2377db96d56Sopenharmony_ci
2387db96d56Sopenharmony_ci   class MyHTMLParser(HTMLParser):
2397db96d56Sopenharmony_ci       def handle_starttag(self, tag, attrs):
2407db96d56Sopenharmony_ci           print("Start tag:", tag)
2417db96d56Sopenharmony_ci           for attr in attrs:
2427db96d56Sopenharmony_ci               print("     attr:", attr)
2437db96d56Sopenharmony_ci
2447db96d56Sopenharmony_ci       def handle_endtag(self, tag):
2457db96d56Sopenharmony_ci           print("End tag  :", tag)
2467db96d56Sopenharmony_ci
2477db96d56Sopenharmony_ci       def handle_data(self, data):
2487db96d56Sopenharmony_ci           print("Data     :", data)
2497db96d56Sopenharmony_ci
2507db96d56Sopenharmony_ci       def handle_comment(self, data):
2517db96d56Sopenharmony_ci           print("Comment  :", data)
2527db96d56Sopenharmony_ci
2537db96d56Sopenharmony_ci       def handle_entityref(self, name):
2547db96d56Sopenharmony_ci           c = chr(name2codepoint[name])
2557db96d56Sopenharmony_ci           print("Named ent:", c)
2567db96d56Sopenharmony_ci
2577db96d56Sopenharmony_ci       def handle_charref(self, name):
2587db96d56Sopenharmony_ci           if name.startswith('x'):
2597db96d56Sopenharmony_ci               c = chr(int(name[1:], 16))
2607db96d56Sopenharmony_ci           else:
2617db96d56Sopenharmony_ci               c = chr(int(name))
2627db96d56Sopenharmony_ci           print("Num ent  :", c)
2637db96d56Sopenharmony_ci
2647db96d56Sopenharmony_ci       def handle_decl(self, data):
2657db96d56Sopenharmony_ci           print("Decl     :", data)
2667db96d56Sopenharmony_ci
2677db96d56Sopenharmony_ci   parser = MyHTMLParser()
2687db96d56Sopenharmony_ci
2697db96d56Sopenharmony_ciParsing a doctype::
2707db96d56Sopenharmony_ci
2717db96d56Sopenharmony_ci   >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
2727db96d56Sopenharmony_ci   ...             '"http://www.w3.org/TR/html4/strict.dtd">')
2737db96d56Sopenharmony_ci   Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
2747db96d56Sopenharmony_ci
2757db96d56Sopenharmony_ciParsing an element with a few attributes and a title::
2767db96d56Sopenharmony_ci
2777db96d56Sopenharmony_ci   >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
2787db96d56Sopenharmony_ci   Start tag: img
2797db96d56Sopenharmony_ci        attr: ('src', 'python-logo.png')
2807db96d56Sopenharmony_ci        attr: ('alt', 'The Python logo')
2817db96d56Sopenharmony_ci   >>>
2827db96d56Sopenharmony_ci   >>> parser.feed('<h1>Python</h1>')
2837db96d56Sopenharmony_ci   Start tag: h1
2847db96d56Sopenharmony_ci   Data     : Python
2857db96d56Sopenharmony_ci   End tag  : h1
2867db96d56Sopenharmony_ci
2877db96d56Sopenharmony_ciThe content of ``script`` and ``style`` elements is returned as is, without
2887db96d56Sopenharmony_cifurther parsing::
2897db96d56Sopenharmony_ci
2907db96d56Sopenharmony_ci   >>> parser.feed('<style type="text/css">#python { color: green }</style>')
2917db96d56Sopenharmony_ci   Start tag: style
2927db96d56Sopenharmony_ci        attr: ('type', 'text/css')
2937db96d56Sopenharmony_ci   Data     : #python { color: green }
2947db96d56Sopenharmony_ci   End tag  : style
2957db96d56Sopenharmony_ci
2967db96d56Sopenharmony_ci   >>> parser.feed('<script type="text/javascript">'
2977db96d56Sopenharmony_ci   ...             'alert("<strong>hello!</strong>");</script>')
2987db96d56Sopenharmony_ci   Start tag: script
2997db96d56Sopenharmony_ci        attr: ('type', 'text/javascript')
3007db96d56Sopenharmony_ci   Data     : alert("<strong>hello!</strong>");
3017db96d56Sopenharmony_ci   End tag  : script
3027db96d56Sopenharmony_ci
3037db96d56Sopenharmony_ciParsing comments::
3047db96d56Sopenharmony_ci
3057db96d56Sopenharmony_ci   >>> parser.feed('<!-- a comment -->'
3067db96d56Sopenharmony_ci   ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
3077db96d56Sopenharmony_ci   Comment  :  a comment
3087db96d56Sopenharmony_ci   Comment  : [if IE 9]>IE-specific content<![endif]
3097db96d56Sopenharmony_ci
3107db96d56Sopenharmony_ciParsing named and numeric character references and converting them to the
3117db96d56Sopenharmony_cicorrect char (note: these 3 references are all equivalent to ``'>'``)::
3127db96d56Sopenharmony_ci
3137db96d56Sopenharmony_ci   >>> parser.feed('&gt;&#62;&#x3E;')
3147db96d56Sopenharmony_ci   Named ent: >
3157db96d56Sopenharmony_ci   Num ent  : >
3167db96d56Sopenharmony_ci   Num ent  : >
3177db96d56Sopenharmony_ci
3187db96d56Sopenharmony_ciFeeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
3197db96d56Sopenharmony_ci:meth:`~HTMLParser.handle_data` might be called more than once
3207db96d56Sopenharmony_ci(unless *convert_charrefs* is set to ``True``)::
3217db96d56Sopenharmony_ci
3227db96d56Sopenharmony_ci   >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
3237db96d56Sopenharmony_ci   ...     parser.feed(chunk)
3247db96d56Sopenharmony_ci   ...
3257db96d56Sopenharmony_ci   Start tag: span
3267db96d56Sopenharmony_ci   Data     : buff
3277db96d56Sopenharmony_ci   Data     : ered
3287db96d56Sopenharmony_ci   Data     : text
3297db96d56Sopenharmony_ci   End tag  : span
3307db96d56Sopenharmony_ci
3317db96d56Sopenharmony_ciParsing invalid HTML (e.g. unquoted attributes) also works::
3327db96d56Sopenharmony_ci
3337db96d56Sopenharmony_ci   >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
3347db96d56Sopenharmony_ci   Start tag: p
3357db96d56Sopenharmony_ci   Start tag: a
3367db96d56Sopenharmony_ci        attr: ('class', 'link')
3377db96d56Sopenharmony_ci        attr: ('href', '#main')
3387db96d56Sopenharmony_ci   Data     : tag soup
3397db96d56Sopenharmony_ci   End tag  : p
3407db96d56Sopenharmony_ci   End tag  : a
341