1:mod:`html.parser` --- Simple HTML and XHTML parser 2=================================================== 3 4.. module:: html.parser 5 :synopsis: A simple parser that can handle HTML and XHTML. 6 7**Source code:** :source:`Lib/html/parser.py` 8 9.. index:: 10 single: HTML 11 single: XHTML 12 13-------------- 14 15This module defines a class :class:`HTMLParser` which serves as the basis for 16parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. 17 18.. class:: HTMLParser(*, convert_charrefs=True) 19 20 Create a parser instance able to parse invalid markup. 21 22 If *convert_charrefs* is ``True`` (the default), all character 23 references (except the ones in ``script``/``style`` elements) are 24 automatically converted to the corresponding Unicode characters. 25 26 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods 27 when start tags, end tags, text, comments, and other markup elements are 28 encountered. The user should subclass :class:`.HTMLParser` and override its 29 methods to implement the desired behavior. 30 31 This parser does not check that end tags match start tags or call the end-tag 32 handler for elements which are closed implicitly by closing an outer element. 33 34 .. versionchanged:: 3.4 35 *convert_charrefs* keyword argument added. 36 37 .. versionchanged:: 3.5 38 The default value for argument *convert_charrefs* is now ``True``. 39 40 41Example HTML Parser Application 42------------------------------- 43 44As a basic example, below is a simple HTML parser that uses the 45:class:`HTMLParser` class to print out start tags, end tags, and data 46as they are encountered:: 47 48 from html.parser import HTMLParser 49 50 class MyHTMLParser(HTMLParser): 51 def handle_starttag(self, tag, attrs): 52 print("Encountered a start tag:", tag) 53 54 def handle_endtag(self, tag): 55 print("Encountered an end tag :", tag) 56 57 def handle_data(self, data): 58 print("Encountered some data :", data) 59 60 parser = MyHTMLParser() 61 parser.feed('<html><head><title>Test</title></head>' 62 '<body><h1>Parse me!</h1></body></html>') 63 64The output will then be: 65 66.. code-block:: none 67 68 Encountered a start tag: html 69 Encountered a start tag: head 70 Encountered a start tag: title 71 Encountered some data : Test 72 Encountered an end tag : title 73 Encountered an end tag : head 74 Encountered a start tag: body 75 Encountered a start tag: h1 76 Encountered some data : Parse me! 77 Encountered an end tag : h1 78 Encountered an end tag : body 79 Encountered an end tag : html 80 81 82:class:`.HTMLParser` Methods 83---------------------------- 84 85:class:`HTMLParser` instances have the following methods: 86 87 88.. method:: HTMLParser.feed(data) 89 90 Feed some text to the parser. It is processed insofar as it consists of 91 complete elements; incomplete data is buffered until more data is fed or 92 :meth:`close` is called. *data* must be :class:`str`. 93 94 95.. method:: HTMLParser.close() 96 97 Force processing of all buffered data as if it were followed by an end-of-file 98 mark. This method may be redefined by a derived class to define additional 99 processing at the end of the input, but the redefined version should always call 100 the :class:`HTMLParser` base class method :meth:`close`. 101 102 103.. method:: HTMLParser.reset() 104 105 Reset the instance. Loses all unprocessed data. This is called implicitly at 106 instantiation time. 107 108 109.. method:: HTMLParser.getpos() 110 111 Return current line number and offset. 112 113 114.. method:: HTMLParser.get_starttag_text() 115 116 Return the text of the most recently opened start tag. This should not normally 117 be needed for structured processing, but may be useful in dealing with HTML "as 118 deployed" or for re-generating input with minimal changes (whitespace between 119 attributes can be preserved, etc.). 120 121 122The following methods are called when data or markup elements are encountered 123and they are meant to be overridden in a subclass. The base class 124implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): 125 126 127.. method:: HTMLParser.handle_starttag(tag, attrs) 128 129 This method is called to handle the start tag of an element (e.g. ``<div id="main">``). 130 131 The *tag* argument is the name of the tag converted to lower case. The *attrs* 132 argument is a list of ``(name, value)`` pairs containing the attributes found 133 inside the tag's ``<>`` brackets. The *name* will be translated to lower case, 134 and quotes in the *value* have been removed, and character and entity references 135 have been replaced. 136 137 For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method 138 would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``. 139 140 All entity references from :mod:`html.entities` are replaced in the attribute 141 values. 142 143 144.. method:: HTMLParser.handle_endtag(tag) 145 146 This method is called to handle the end tag of an element (e.g. ``</div>``). 147 148 The *tag* argument is the name of the tag converted to lower case. 149 150 151.. method:: HTMLParser.handle_startendtag(tag, attrs) 152 153 Similar to :meth:`handle_starttag`, but called when the parser encounters an 154 XHTML-style empty tag (``<img ... />``). This method may be overridden by 155 subclasses which require this particular lexical information; the default 156 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. 157 158 159.. method:: HTMLParser.handle_data(data) 160 161 This method is called to process arbitrary data (e.g. text nodes and the 162 content of ``<script>...</script>`` and ``<style>...</style>``). 163 164 165.. method:: HTMLParser.handle_entityref(name) 166 167 This method is called to process a named character reference of the form 168 ``&name;`` (e.g. ``>``), where *name* is a general entity reference 169 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is 170 ``True``. 171 172 173.. method:: HTMLParser.handle_charref(name) 174 175 This method is called to process decimal and hexadecimal numeric character 176 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal 177 equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; 178 in this case the method will receive ``'62'`` or ``'x3E'``. This method 179 is never called if *convert_charrefs* is ``True``. 180 181 182.. method:: HTMLParser.handle_comment(data) 183 184 This method is called when a comment is encountered (e.g. ``<!--comment-->``). 185 186 For example, the comment ``<!-- comment -->`` will cause this method to be 187 called with the argument ``' comment '``. 188 189 The content of Internet Explorer conditional comments (condcoms) will also be 190 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``, 191 this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``. 192 193 194.. method:: HTMLParser.handle_decl(decl) 195 196 This method is called to handle an HTML doctype declaration (e.g. 197 ``<!DOCTYPE html>``). 198 199 The *decl* parameter will be the entire contents of the declaration inside 200 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``). 201 202 203.. method:: HTMLParser.handle_pi(data) 204 205 Method called when a processing instruction is encountered. The *data* 206 parameter will contain the entire processing instruction. For example, for the 207 processing instruction ``<?proc color='red'>``, this method would be called as 208 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived 209 class; the base class implementation does nothing. 210 211 .. note:: 212 213 The :class:`HTMLParser` class uses the SGML syntactic rules for processing 214 instructions. An XHTML processing instruction using the trailing ``'?'`` will 215 cause the ``'?'`` to be included in *data*. 216 217 218.. method:: HTMLParser.unknown_decl(data) 219 220 This method is called when an unrecognized declaration is read by the parser. 221 222 The *data* parameter will be the entire contents of the declaration inside 223 the ``<![...]>`` markup. It is sometimes useful to be overridden by a 224 derived class. The base class implementation does nothing. 225 226 227.. _htmlparser-examples: 228 229Examples 230-------- 231 232The following class implements a parser that will be used to illustrate more 233examples:: 234 235 from html.parser import HTMLParser 236 from html.entities import name2codepoint 237 238 class MyHTMLParser(HTMLParser): 239 def handle_starttag(self, tag, attrs): 240 print("Start tag:", tag) 241 for attr in attrs: 242 print(" attr:", attr) 243 244 def handle_endtag(self, tag): 245 print("End tag :", tag) 246 247 def handle_data(self, data): 248 print("Data :", data) 249 250 def handle_comment(self, data): 251 print("Comment :", data) 252 253 def handle_entityref(self, name): 254 c = chr(name2codepoint[name]) 255 print("Named ent:", c) 256 257 def handle_charref(self, name): 258 if name.startswith('x'): 259 c = chr(int(name[1:], 16)) 260 else: 261 c = chr(int(name)) 262 print("Num ent :", c) 263 264 def handle_decl(self, data): 265 print("Decl :", data) 266 267 parser = MyHTMLParser() 268 269Parsing a doctype:: 270 271 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' 272 ... '"http://www.w3.org/TR/html4/strict.dtd">') 273 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" 274 275Parsing an element with a few attributes and a title:: 276 277 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') 278 Start tag: img 279 attr: ('src', 'python-logo.png') 280 attr: ('alt', 'The Python logo') 281 >>> 282 >>> parser.feed('<h1>Python</h1>') 283 Start tag: h1 284 Data : Python 285 End tag : h1 286 287The content of ``script`` and ``style`` elements is returned as is, without 288further parsing:: 289 290 >>> parser.feed('<style type="text/css">#python { color: green }</style>') 291 Start tag: style 292 attr: ('type', 'text/css') 293 Data : #python { color: green } 294 End tag : style 295 296 >>> parser.feed('<script type="text/javascript">' 297 ... 'alert("<strong>hello!</strong>");</script>') 298 Start tag: script 299 attr: ('type', 'text/javascript') 300 Data : alert("<strong>hello!</strong>"); 301 End tag : script 302 303Parsing comments:: 304 305 >>> parser.feed('<!-- a comment -->' 306 ... '<!--[if IE 9]>IE-specific content<![endif]-->') 307 Comment : a comment 308 Comment : [if IE 9]>IE-specific content<![endif] 309 310Parsing named and numeric character references and converting them to the 311correct char (note: these 3 references are all equivalent to ``'>'``):: 312 313 >>> parser.feed('>>>') 314 Named ent: > 315 Num ent : > 316 Num ent : > 317 318Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but 319:meth:`~HTMLParser.handle_data` might be called more than once 320(unless *convert_charrefs* is set to ``True``):: 321 322 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: 323 ... parser.feed(chunk) 324 ... 325 Start tag: span 326 Data : buff 327 Data : ered 328 Data : text 329 End tag : span 330 331Parsing invalid HTML (e.g. unquoted attributes) also works:: 332 333 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') 334 Start tag: p 335 Start tag: a 336 attr: ('class', 'link') 337 attr: ('href', '#main') 338 Data : tag soup 339 End tag : p 340 End tag : a 341