17db96d56Sopenharmony_ci:mod:`html.parser` --- Simple HTML and XHTML parser 27db96d56Sopenharmony_ci=================================================== 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: html.parser 57db96d56Sopenharmony_ci :synopsis: A simple parser that can handle HTML and XHTML. 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci**Source code:** :source:`Lib/html/parser.py` 87db96d56Sopenharmony_ci 97db96d56Sopenharmony_ci.. index:: 107db96d56Sopenharmony_ci single: HTML 117db96d56Sopenharmony_ci single: XHTML 127db96d56Sopenharmony_ci 137db96d56Sopenharmony_ci-------------- 147db96d56Sopenharmony_ci 157db96d56Sopenharmony_ciThis module defines a class :class:`HTMLParser` which serves as the basis for 167db96d56Sopenharmony_ciparsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. 177db96d56Sopenharmony_ci 187db96d56Sopenharmony_ci.. class:: HTMLParser(*, convert_charrefs=True) 197db96d56Sopenharmony_ci 207db96d56Sopenharmony_ci Create a parser instance able to parse invalid markup. 217db96d56Sopenharmony_ci 227db96d56Sopenharmony_ci If *convert_charrefs* is ``True`` (the default), all character 237db96d56Sopenharmony_ci references (except the ones in ``script``/``style`` elements) are 247db96d56Sopenharmony_ci automatically converted to the corresponding Unicode characters. 257db96d56Sopenharmony_ci 267db96d56Sopenharmony_ci An :class:`.HTMLParser` instance is fed HTML data and calls handler methods 277db96d56Sopenharmony_ci when start tags, end tags, text, comments, and other markup elements are 287db96d56Sopenharmony_ci encountered. The user should subclass :class:`.HTMLParser` and override its 297db96d56Sopenharmony_ci methods to implement the desired behavior. 307db96d56Sopenharmony_ci 317db96d56Sopenharmony_ci This parser does not check that end tags match start tags or call the end-tag 327db96d56Sopenharmony_ci handler for elements which are closed implicitly by closing an outer element. 337db96d56Sopenharmony_ci 347db96d56Sopenharmony_ci .. versionchanged:: 3.4 357db96d56Sopenharmony_ci *convert_charrefs* keyword argument added. 367db96d56Sopenharmony_ci 377db96d56Sopenharmony_ci .. versionchanged:: 3.5 387db96d56Sopenharmony_ci The default value for argument *convert_charrefs* is now ``True``. 397db96d56Sopenharmony_ci 407db96d56Sopenharmony_ci 417db96d56Sopenharmony_ciExample HTML Parser Application 427db96d56Sopenharmony_ci------------------------------- 437db96d56Sopenharmony_ci 447db96d56Sopenharmony_ciAs a basic example, below is a simple HTML parser that uses the 457db96d56Sopenharmony_ci:class:`HTMLParser` class to print out start tags, end tags, and data 467db96d56Sopenharmony_cias they are encountered:: 477db96d56Sopenharmony_ci 487db96d56Sopenharmony_ci from html.parser import HTMLParser 497db96d56Sopenharmony_ci 507db96d56Sopenharmony_ci class MyHTMLParser(HTMLParser): 517db96d56Sopenharmony_ci def handle_starttag(self, tag, attrs): 527db96d56Sopenharmony_ci print("Encountered a start tag:", tag) 537db96d56Sopenharmony_ci 547db96d56Sopenharmony_ci def handle_endtag(self, tag): 557db96d56Sopenharmony_ci print("Encountered an end tag :", tag) 567db96d56Sopenharmony_ci 577db96d56Sopenharmony_ci def handle_data(self, data): 587db96d56Sopenharmony_ci print("Encountered some data :", data) 597db96d56Sopenharmony_ci 607db96d56Sopenharmony_ci parser = MyHTMLParser() 617db96d56Sopenharmony_ci parser.feed('<html><head><title>Test</title></head>' 627db96d56Sopenharmony_ci '<body><h1>Parse me!</h1></body></html>') 637db96d56Sopenharmony_ci 647db96d56Sopenharmony_ciThe output will then be: 657db96d56Sopenharmony_ci 667db96d56Sopenharmony_ci.. code-block:: none 677db96d56Sopenharmony_ci 687db96d56Sopenharmony_ci Encountered a start tag: html 697db96d56Sopenharmony_ci Encountered a start tag: head 707db96d56Sopenharmony_ci Encountered a start tag: title 717db96d56Sopenharmony_ci Encountered some data : Test 727db96d56Sopenharmony_ci Encountered an end tag : title 737db96d56Sopenharmony_ci Encountered an end tag : head 747db96d56Sopenharmony_ci Encountered a start tag: body 757db96d56Sopenharmony_ci Encountered a start tag: h1 767db96d56Sopenharmony_ci Encountered some data : Parse me! 777db96d56Sopenharmony_ci Encountered an end tag : h1 787db96d56Sopenharmony_ci Encountered an end tag : body 797db96d56Sopenharmony_ci Encountered an end tag : html 807db96d56Sopenharmony_ci 817db96d56Sopenharmony_ci 827db96d56Sopenharmony_ci:class:`.HTMLParser` Methods 837db96d56Sopenharmony_ci---------------------------- 847db96d56Sopenharmony_ci 857db96d56Sopenharmony_ci:class:`HTMLParser` instances have the following methods: 867db96d56Sopenharmony_ci 877db96d56Sopenharmony_ci 887db96d56Sopenharmony_ci.. method:: HTMLParser.feed(data) 897db96d56Sopenharmony_ci 907db96d56Sopenharmony_ci Feed some text to the parser. It is processed insofar as it consists of 917db96d56Sopenharmony_ci complete elements; incomplete data is buffered until more data is fed or 927db96d56Sopenharmony_ci :meth:`close` is called. *data* must be :class:`str`. 937db96d56Sopenharmony_ci 947db96d56Sopenharmony_ci 957db96d56Sopenharmony_ci.. method:: HTMLParser.close() 967db96d56Sopenharmony_ci 977db96d56Sopenharmony_ci Force processing of all buffered data as if it were followed by an end-of-file 987db96d56Sopenharmony_ci mark. This method may be redefined by a derived class to define additional 997db96d56Sopenharmony_ci processing at the end of the input, but the redefined version should always call 1007db96d56Sopenharmony_ci the :class:`HTMLParser` base class method :meth:`close`. 1017db96d56Sopenharmony_ci 1027db96d56Sopenharmony_ci 1037db96d56Sopenharmony_ci.. method:: HTMLParser.reset() 1047db96d56Sopenharmony_ci 1057db96d56Sopenharmony_ci Reset the instance. Loses all unprocessed data. This is called implicitly at 1067db96d56Sopenharmony_ci instantiation time. 1077db96d56Sopenharmony_ci 1087db96d56Sopenharmony_ci 1097db96d56Sopenharmony_ci.. method:: HTMLParser.getpos() 1107db96d56Sopenharmony_ci 1117db96d56Sopenharmony_ci Return current line number and offset. 1127db96d56Sopenharmony_ci 1137db96d56Sopenharmony_ci 1147db96d56Sopenharmony_ci.. method:: HTMLParser.get_starttag_text() 1157db96d56Sopenharmony_ci 1167db96d56Sopenharmony_ci Return the text of the most recently opened start tag. This should not normally 1177db96d56Sopenharmony_ci be needed for structured processing, but may be useful in dealing with HTML "as 1187db96d56Sopenharmony_ci deployed" or for re-generating input with minimal changes (whitespace between 1197db96d56Sopenharmony_ci attributes can be preserved, etc.). 1207db96d56Sopenharmony_ci 1217db96d56Sopenharmony_ci 1227db96d56Sopenharmony_ciThe following methods are called when data or markup elements are encountered 1237db96d56Sopenharmony_ciand they are meant to be overridden in a subclass. The base class 1247db96d56Sopenharmony_ciimplementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): 1257db96d56Sopenharmony_ci 1267db96d56Sopenharmony_ci 1277db96d56Sopenharmony_ci.. method:: HTMLParser.handle_starttag(tag, attrs) 1287db96d56Sopenharmony_ci 1297db96d56Sopenharmony_ci This method is called to handle the start tag of an element (e.g. ``<div id="main">``). 1307db96d56Sopenharmony_ci 1317db96d56Sopenharmony_ci The *tag* argument is the name of the tag converted to lower case. The *attrs* 1327db96d56Sopenharmony_ci argument is a list of ``(name, value)`` pairs containing the attributes found 1337db96d56Sopenharmony_ci inside the tag's ``<>`` brackets. The *name* will be translated to lower case, 1347db96d56Sopenharmony_ci and quotes in the *value* have been removed, and character and entity references 1357db96d56Sopenharmony_ci have been replaced. 1367db96d56Sopenharmony_ci 1377db96d56Sopenharmony_ci For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method 1387db96d56Sopenharmony_ci would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``. 1397db96d56Sopenharmony_ci 1407db96d56Sopenharmony_ci All entity references from :mod:`html.entities` are replaced in the attribute 1417db96d56Sopenharmony_ci values. 1427db96d56Sopenharmony_ci 1437db96d56Sopenharmony_ci 1447db96d56Sopenharmony_ci.. method:: HTMLParser.handle_endtag(tag) 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci This method is called to handle the end tag of an element (e.g. ``</div>``). 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci The *tag* argument is the name of the tag converted to lower case. 1497db96d56Sopenharmony_ci 1507db96d56Sopenharmony_ci 1517db96d56Sopenharmony_ci.. method:: HTMLParser.handle_startendtag(tag, attrs) 1527db96d56Sopenharmony_ci 1537db96d56Sopenharmony_ci Similar to :meth:`handle_starttag`, but called when the parser encounters an 1547db96d56Sopenharmony_ci XHTML-style empty tag (``<img ... />``). This method may be overridden by 1557db96d56Sopenharmony_ci subclasses which require this particular lexical information; the default 1567db96d56Sopenharmony_ci implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. 1577db96d56Sopenharmony_ci 1587db96d56Sopenharmony_ci 1597db96d56Sopenharmony_ci.. method:: HTMLParser.handle_data(data) 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci This method is called to process arbitrary data (e.g. text nodes and the 1627db96d56Sopenharmony_ci content of ``<script>...</script>`` and ``<style>...</style>``). 1637db96d56Sopenharmony_ci 1647db96d56Sopenharmony_ci 1657db96d56Sopenharmony_ci.. method:: HTMLParser.handle_entityref(name) 1667db96d56Sopenharmony_ci 1677db96d56Sopenharmony_ci This method is called to process a named character reference of the form 1687db96d56Sopenharmony_ci ``&name;`` (e.g. ``>``), where *name* is a general entity reference 1697db96d56Sopenharmony_ci (e.g. ``'gt'``). This method is never called if *convert_charrefs* is 1707db96d56Sopenharmony_ci ``True``. 1717db96d56Sopenharmony_ci 1727db96d56Sopenharmony_ci 1737db96d56Sopenharmony_ci.. method:: HTMLParser.handle_charref(name) 1747db96d56Sopenharmony_ci 1757db96d56Sopenharmony_ci This method is called to process decimal and hexadecimal numeric character 1767db96d56Sopenharmony_ci references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal 1777db96d56Sopenharmony_ci equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; 1787db96d56Sopenharmony_ci in this case the method will receive ``'62'`` or ``'x3E'``. This method 1797db96d56Sopenharmony_ci is never called if *convert_charrefs* is ``True``. 1807db96d56Sopenharmony_ci 1817db96d56Sopenharmony_ci 1827db96d56Sopenharmony_ci.. method:: HTMLParser.handle_comment(data) 1837db96d56Sopenharmony_ci 1847db96d56Sopenharmony_ci This method is called when a comment is encountered (e.g. ``<!--comment-->``). 1857db96d56Sopenharmony_ci 1867db96d56Sopenharmony_ci For example, the comment ``<!-- comment -->`` will cause this method to be 1877db96d56Sopenharmony_ci called with the argument ``' comment '``. 1887db96d56Sopenharmony_ci 1897db96d56Sopenharmony_ci The content of Internet Explorer conditional comments (condcoms) will also be 1907db96d56Sopenharmony_ci sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``, 1917db96d56Sopenharmony_ci this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``. 1927db96d56Sopenharmony_ci 1937db96d56Sopenharmony_ci 1947db96d56Sopenharmony_ci.. method:: HTMLParser.handle_decl(decl) 1957db96d56Sopenharmony_ci 1967db96d56Sopenharmony_ci This method is called to handle an HTML doctype declaration (e.g. 1977db96d56Sopenharmony_ci ``<!DOCTYPE html>``). 1987db96d56Sopenharmony_ci 1997db96d56Sopenharmony_ci The *decl* parameter will be the entire contents of the declaration inside 2007db96d56Sopenharmony_ci the ``<!...>`` markup (e.g. ``'DOCTYPE html'``). 2017db96d56Sopenharmony_ci 2027db96d56Sopenharmony_ci 2037db96d56Sopenharmony_ci.. method:: HTMLParser.handle_pi(data) 2047db96d56Sopenharmony_ci 2057db96d56Sopenharmony_ci Method called when a processing instruction is encountered. The *data* 2067db96d56Sopenharmony_ci parameter will contain the entire processing instruction. For example, for the 2077db96d56Sopenharmony_ci processing instruction ``<?proc color='red'>``, this method would be called as 2087db96d56Sopenharmony_ci ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived 2097db96d56Sopenharmony_ci class; the base class implementation does nothing. 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ci .. note:: 2127db96d56Sopenharmony_ci 2137db96d56Sopenharmony_ci The :class:`HTMLParser` class uses the SGML syntactic rules for processing 2147db96d56Sopenharmony_ci instructions. An XHTML processing instruction using the trailing ``'?'`` will 2157db96d56Sopenharmony_ci cause the ``'?'`` to be included in *data*. 2167db96d56Sopenharmony_ci 2177db96d56Sopenharmony_ci 2187db96d56Sopenharmony_ci.. method:: HTMLParser.unknown_decl(data) 2197db96d56Sopenharmony_ci 2207db96d56Sopenharmony_ci This method is called when an unrecognized declaration is read by the parser. 2217db96d56Sopenharmony_ci 2227db96d56Sopenharmony_ci The *data* parameter will be the entire contents of the declaration inside 2237db96d56Sopenharmony_ci the ``<![...]>`` markup. It is sometimes useful to be overridden by a 2247db96d56Sopenharmony_ci derived class. The base class implementation does nothing. 2257db96d56Sopenharmony_ci 2267db96d56Sopenharmony_ci 2277db96d56Sopenharmony_ci.. _htmlparser-examples: 2287db96d56Sopenharmony_ci 2297db96d56Sopenharmony_ciExamples 2307db96d56Sopenharmony_ci-------- 2317db96d56Sopenharmony_ci 2327db96d56Sopenharmony_ciThe following class implements a parser that will be used to illustrate more 2337db96d56Sopenharmony_ciexamples:: 2347db96d56Sopenharmony_ci 2357db96d56Sopenharmony_ci from html.parser import HTMLParser 2367db96d56Sopenharmony_ci from html.entities import name2codepoint 2377db96d56Sopenharmony_ci 2387db96d56Sopenharmony_ci class MyHTMLParser(HTMLParser): 2397db96d56Sopenharmony_ci def handle_starttag(self, tag, attrs): 2407db96d56Sopenharmony_ci print("Start tag:", tag) 2417db96d56Sopenharmony_ci for attr in attrs: 2427db96d56Sopenharmony_ci print(" attr:", attr) 2437db96d56Sopenharmony_ci 2447db96d56Sopenharmony_ci def handle_endtag(self, tag): 2457db96d56Sopenharmony_ci print("End tag :", tag) 2467db96d56Sopenharmony_ci 2477db96d56Sopenharmony_ci def handle_data(self, data): 2487db96d56Sopenharmony_ci print("Data :", data) 2497db96d56Sopenharmony_ci 2507db96d56Sopenharmony_ci def handle_comment(self, data): 2517db96d56Sopenharmony_ci print("Comment :", data) 2527db96d56Sopenharmony_ci 2537db96d56Sopenharmony_ci def handle_entityref(self, name): 2547db96d56Sopenharmony_ci c = chr(name2codepoint[name]) 2557db96d56Sopenharmony_ci print("Named ent:", c) 2567db96d56Sopenharmony_ci 2577db96d56Sopenharmony_ci def handle_charref(self, name): 2587db96d56Sopenharmony_ci if name.startswith('x'): 2597db96d56Sopenharmony_ci c = chr(int(name[1:], 16)) 2607db96d56Sopenharmony_ci else: 2617db96d56Sopenharmony_ci c = chr(int(name)) 2627db96d56Sopenharmony_ci print("Num ent :", c) 2637db96d56Sopenharmony_ci 2647db96d56Sopenharmony_ci def handle_decl(self, data): 2657db96d56Sopenharmony_ci print("Decl :", data) 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ci parser = MyHTMLParser() 2687db96d56Sopenharmony_ci 2697db96d56Sopenharmony_ciParsing a doctype:: 2707db96d56Sopenharmony_ci 2717db96d56Sopenharmony_ci >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' 2727db96d56Sopenharmony_ci ... '"http://www.w3.org/TR/html4/strict.dtd">') 2737db96d56Sopenharmony_ci Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" 2747db96d56Sopenharmony_ci 2757db96d56Sopenharmony_ciParsing an element with a few attributes and a title:: 2767db96d56Sopenharmony_ci 2777db96d56Sopenharmony_ci >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') 2787db96d56Sopenharmony_ci Start tag: img 2797db96d56Sopenharmony_ci attr: ('src', 'python-logo.png') 2807db96d56Sopenharmony_ci attr: ('alt', 'The Python logo') 2817db96d56Sopenharmony_ci >>> 2827db96d56Sopenharmony_ci >>> parser.feed('<h1>Python</h1>') 2837db96d56Sopenharmony_ci Start tag: h1 2847db96d56Sopenharmony_ci Data : Python 2857db96d56Sopenharmony_ci End tag : h1 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ciThe content of ``script`` and ``style`` elements is returned as is, without 2887db96d56Sopenharmony_cifurther parsing:: 2897db96d56Sopenharmony_ci 2907db96d56Sopenharmony_ci >>> parser.feed('<style type="text/css">#python { color: green }</style>') 2917db96d56Sopenharmony_ci Start tag: style 2927db96d56Sopenharmony_ci attr: ('type', 'text/css') 2937db96d56Sopenharmony_ci Data : #python { color: green } 2947db96d56Sopenharmony_ci End tag : style 2957db96d56Sopenharmony_ci 2967db96d56Sopenharmony_ci >>> parser.feed('<script type="text/javascript">' 2977db96d56Sopenharmony_ci ... 'alert("<strong>hello!</strong>");</script>') 2987db96d56Sopenharmony_ci Start tag: script 2997db96d56Sopenharmony_ci attr: ('type', 'text/javascript') 3007db96d56Sopenharmony_ci Data : alert("<strong>hello!</strong>"); 3017db96d56Sopenharmony_ci End tag : script 3027db96d56Sopenharmony_ci 3037db96d56Sopenharmony_ciParsing comments:: 3047db96d56Sopenharmony_ci 3057db96d56Sopenharmony_ci >>> parser.feed('<!-- a comment -->' 3067db96d56Sopenharmony_ci ... '<!--[if IE 9]>IE-specific content<![endif]-->') 3077db96d56Sopenharmony_ci Comment : a comment 3087db96d56Sopenharmony_ci Comment : [if IE 9]>IE-specific content<![endif] 3097db96d56Sopenharmony_ci 3107db96d56Sopenharmony_ciParsing named and numeric character references and converting them to the 3117db96d56Sopenharmony_cicorrect char (note: these 3 references are all equivalent to ``'>'``):: 3127db96d56Sopenharmony_ci 3137db96d56Sopenharmony_ci >>> parser.feed('>>>') 3147db96d56Sopenharmony_ci Named ent: > 3157db96d56Sopenharmony_ci Num ent : > 3167db96d56Sopenharmony_ci Num ent : > 3177db96d56Sopenharmony_ci 3187db96d56Sopenharmony_ciFeeding incomplete chunks to :meth:`~HTMLParser.feed` works, but 3197db96d56Sopenharmony_ci:meth:`~HTMLParser.handle_data` might be called more than once 3207db96d56Sopenharmony_ci(unless *convert_charrefs* is set to ``True``):: 3217db96d56Sopenharmony_ci 3227db96d56Sopenharmony_ci >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: 3237db96d56Sopenharmony_ci ... parser.feed(chunk) 3247db96d56Sopenharmony_ci ... 3257db96d56Sopenharmony_ci Start tag: span 3267db96d56Sopenharmony_ci Data : buff 3277db96d56Sopenharmony_ci Data : ered 3287db96d56Sopenharmony_ci Data : text 3297db96d56Sopenharmony_ci End tag : span 3307db96d56Sopenharmony_ci 3317db96d56Sopenharmony_ciParsing invalid HTML (e.g. unquoted attributes) also works:: 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ci >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') 3347db96d56Sopenharmony_ci Start tag: p 3357db96d56Sopenharmony_ci Start tag: a 3367db96d56Sopenharmony_ci attr: ('class', 'link') 3377db96d56Sopenharmony_ci attr: ('href', '#main') 3387db96d56Sopenharmony_ci Data : tag soup 3397db96d56Sopenharmony_ci End tag : p 3407db96d56Sopenharmony_ci End tag : a 341