1html5lib 2======== 3 4.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master 5 :target: https://travis-ci.org/html5lib/html5lib-python 6 7html5lib is a pure-python library for parsing HTML. It is designed to 8conform to the WHATWG HTML specification, as is implemented by all major 9web browsers. 10 11 12Usage 13----- 14 15Simple usage follows this pattern: 16 17.. code-block:: python 18 19 import html5lib 20 with open("mydocument.html", "rb") as f: 21 document = html5lib.parse(f) 22 23or: 24 25.. code-block:: python 26 27 import html5lib 28 document = html5lib.parse("<p>Hello World!") 29 30By default, the ``document`` will be an ``xml.etree`` element instance. 31Whenever possible, html5lib chooses the accelerated ``ElementTree`` 32implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x). 33 34Two other tree types are supported: ``xml.dom.minidom`` and 35``lxml.etree``. To use an alternative format, specify the name of 36a treebuilder: 37 38.. code-block:: python 39 40 import html5lib 41 with open("mydocument.html", "rb") as f: 42 lxml_etree_document = html5lib.parse(f, treebuilder="lxml") 43 44When using with ``urllib2`` (Python 2), the charset from HTTP should be 45pass into html5lib as follows: 46 47.. code-block:: python 48 49 from contextlib import closing 50 from urllib2 import urlopen 51 import html5lib 52 53 with closing(urlopen("http://example.com/")) as f: 54 document = html5lib.parse(f, encoding=f.info().getparam("charset")) 55 56When using with ``urllib.request`` (Python 3), the charset from HTTP 57should be pass into html5lib as follows: 58 59.. code-block:: python 60 61 from urllib.request import urlopen 62 import html5lib 63 64 with urlopen("http://example.com/") as f: 65 document = html5lib.parse(f, encoding=f.info().get_content_charset()) 66 67To have more control over the parser, create a parser object explicitly. 68For instance, to make the parser raise exceptions on parse errors, use: 69 70.. code-block:: python 71 72 import html5lib 73 with open("mydocument.html", "rb") as f: 74 parser = html5lib.HTMLParser(strict=True) 75 document = parser.parse(f) 76 77When you're instantiating parser objects explicitly, pass a treebuilder 78class as the ``tree`` keyword argument to use an alternative document 79format: 80 81.. code-block:: python 82 83 import html5lib 84 parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) 85 minidom_document = parser.parse("<p>Hello World!") 86 87More documentation is available at http://html5lib.readthedocs.org/. 88 89 90Installation 91------------ 92 93html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it, 94use: 95 96.. code-block:: bash 97 98 $ pip install html5lib 99 100 101Optional Dependencies 102--------------------- 103 104The following third-party libraries may be used for additional 105functionality: 106 107- ``datrie`` can be used to improve parsing performance (though in 108 almost all cases the improvement is marginal); 109 110- ``lxml`` is supported as a tree format (for both building and 111 walking) under CPython (but *not* PyPy where it is known to cause 112 segfaults); 113 114- ``genshi`` has a treewalker (but not builder); and 115 116- ``charade`` can be used as a fallback when character encoding cannot 117 be determined; ``chardet``, from which it was forked, can also be used 118 on Python 2. 119 120- ``ordereddict`` can be used under Python 2.6 121 (``collections.OrderedDict`` is used instead on later versions) to 122 serialize attributes in alphabetical order. 123 124 125Bugs 126---- 127 128Please report any bugs on the `issue tracker 129<https://github.com/html5lib/html5lib-python/issues>`_. 130 131 132Tests 133----- 134 135Unit tests require the ``nose`` library and can be run using the 136``nosetests`` command in the root directory; ``ordereddict`` is 137required under Python 2.6. All should pass. 138 139Test data are contained in a separate `html5lib-tests 140<https://github.com/html5lib/html5lib-tests>`_ repository and included 141as a submodule, thus for git checkouts they must be initialized:: 142 143 $ git submodule init 144 $ git submodule update 145 146If you have all compatible Python implementations available on your 147system, you can run tests on all of them using the ``tox`` utility, 148which can be found on PyPI. 149 150 151Questions? 152---------- 153 154There's a mailing list available for support on Google Groups, 155`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_, 156though you may get a quicker response asking on IRC in `#whatwg on 157irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_. 158