• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Beautiful Soup Documentation
2============================
3
4.. image:: 6.1.jpg
5   :align: right
6   :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
7
8`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a
9Python library for pulling data out of HTML and XML files. It works
10with your favorite parser to provide idiomatic ways of navigating,
11searching, and modifying the parse tree. It commonly saves programmers
12hours or days of work.
13
14These instructions illustrate all major features of Beautiful Soup 4,
15with examples. I show you what the library is good for, how it works,
16how to use it, how to make it do what you want, and what to do when it
17violates your expectations.
18
19The examples in this documentation should work the same way in Python
202.7 and Python 3.2.
21
22You might be looking for the documentation for `Beautiful Soup 3
23<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_.
24If so, you should know that Beautiful Soup 3 is no longer being
25developed, and that Beautiful Soup 4 is recommended for all new
26projects. If you want to learn about the differences between Beautiful
27Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_.
28
29This documentation has been translated into other languages by its users.
30
31* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
32
33Getting help
34------------
35
36If you have questions about Beautiful Soup, or run into problems,
37`send mail to the discussion group
38<https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_. If
39your problem involves parsing an HTML document, be sure to mention
40:ref:`what the diagnose() function says <diagnose>` about
41that document.
42
43Quick Start
44===========
45
46Here's an HTML document I'll be using as an example throughout this
47document. It's part of a story from `Alice in Wonderland`::
48
49 html_doc = """
50 <html><head><title>The Dormouse's story</title></head>
51 <body>
52 <p class="title"><b>The Dormouse's story</b></p>
53
54 <p class="story">Once upon a time there were three little sisters; and their names were
55 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
56 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
57 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
58 and they lived at the bottom of a well.</p>
59
60 <p class="story">...</p>
61 """
62
63Running the "three sisters" document through Beautiful Soup gives us a
64``BeautifulSoup`` object, which represents the document as a nested
65data structure::
66
67 from bs4 import BeautifulSoup
68 soup = BeautifulSoup(html_doc)
69
70 print(soup.prettify())
71 # <html>
72 #  <head>
73 #   <title>
74 #    The Dormouse's story
75 #   </title>
76 #  </head>
77 #  <body>
78 #   <p class="title">
79 #    <b>
80 #     The Dormouse's story
81 #    </b>
82 #   </p>
83 #   <p class="story">
84 #    Once upon a time there were three little sisters; and their names were
85 #    <a class="sister" href="http://example.com/elsie" id="link1">
86 #     Elsie
87 #    </a>
88 #    ,
89 #    <a class="sister" href="http://example.com/lacie" id="link2">
90 #     Lacie
91 #    </a>
92 #    and
93 #    <a class="sister" href="http://example.com/tillie" id="link2">
94 #     Tillie
95 #    </a>
96 #    ; and they lived at the bottom of a well.
97 #   </p>
98 #   <p class="story">
99 #    ...
100 #   </p>
101 #  </body>
102 # </html>
103
104Here are some simple ways to navigate that data structure::
105
106 soup.title
107 # <title>The Dormouse's story</title>
108
109 soup.title.name
110 # u'title'
111
112 soup.title.string
113 # u'The Dormouse's story'
114
115 soup.title.parent.name
116 # u'head'
117
118 soup.p
119 # <p class="title"><b>The Dormouse's story</b></p>
120
121 soup.p['class']
122 # u'title'
123
124 soup.a
125 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
126
127 soup.find_all('a')
128 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
129 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
130 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
131
132 soup.find(id="link3")
133 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
134
135One common task is extracting all the URLs found within a page's <a> tags::
136
137 for link in soup.find_all('a'):
138     print(link.get('href'))
139 # http://example.com/elsie
140 # http://example.com/lacie
141 # http://example.com/tillie
142
143Another common task is extracting all the text from a page::
144
145 print(soup.get_text())
146 # The Dormouse's story
147 #
148 # The Dormouse's story
149 #
150 # Once upon a time there were three little sisters; and their names were
151 # Elsie,
152 # Lacie and
153 # Tillie;
154 # and they lived at the bottom of a well.
155 #
156 # ...
157
158Does this look like what you need? If so, read on.
159
160Installing Beautiful Soup
161=========================
162
163If you're using a recent version of Debian or Ubuntu Linux, you can
164install Beautiful Soup with the system package manager:
165
166:kbd:`$ apt-get install python-bs4`
167
168Beautiful Soup 4 is published through PyPi, so if you can't install it
169with the system packager, you can install it with ``easy_install`` or
170``pip``. The package name is ``beautifulsoup4``, and the same package
171works on Python 2 and Python 3.
172
173:kbd:`$ easy_install beautifulsoup4`
174
175:kbd:`$ pip install beautifulsoup4`
176
177(The ``BeautifulSoup`` package is probably `not` what you want. That's
178the previous major release, `Beautiful Soup 3`_. Lots of software uses
179BS3, so it's still available, but if you're writing new code you
180should install ``beautifulsoup4``.)
181
182If you don't have ``easy_install`` or ``pip`` installed, you can
183`download the Beautiful Soup 4 source tarball
184<http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and
185install it with ``setup.py``.
186
187:kbd:`$ python setup.py install`
188
189If all else fails, the license for Beautiful Soup allows you to
190package the entire library with your application. You can download the
191tarball, copy its ``bs4`` directory into your application's codebase,
192and use Beautiful Soup without installing it at all.
193
194I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
195should work with other recent versions.
196
197Problems after installation
198---------------------------
199
200Beautiful Soup is packaged as Python 2 code. When you install it for
201use with Python 3, it's automatically converted to Python 3 code. If
202you don't install the package, the code won't be converted. There have
203also been reports on Windows machines of the wrong version being
204installed.
205
206If you get the ``ImportError`` "No module named HTMLParser", your
207problem is that you're running the Python 2 version of the code under
208Python 3.
209
210If you get the ``ImportError`` "No module named html.parser", your
211problem is that you're running the Python 3 version of the code under
212Python 2.
213
214In both cases, your best bet is to completely remove the Beautiful
215Soup installation from your system (including any directory created
216when you unzipped the tarball) and try the installation again.
217
218If you get the ``SyntaxError`` "Invalid syntax" on the line
219``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2
220code to Python 3. You can do this either by installing the package:
221
222:kbd:`$ python3 setup.py install`
223
224or by manually running Python's ``2to3`` conversion script on the
225``bs4`` directory:
226
227:kbd:`$ 2to3-3.2 -w bs4`
228
229.. _parser-installation:
230
231
232Installing a parser
233-------------------
234
235Beautiful Soup supports the HTML parser included in Python's standard
236library, but it also supports a number of third-party Python parsers.
237One is the `lxml parser <http://lxml.de/>`_. Depending on your setup,
238you might install lxml with one of these commands:
239
240:kbd:`$ apt-get install python-lxml`
241
242:kbd:`$ easy_install lxml`
243
244:kbd:`$ pip install lxml`
245
246Another alternative is the pure-Python `html5lib parser
247<http://code.google.com/p/html5lib/>`_, which parses HTML the way a
248web browser does. Depending on your setup, you might install html5lib
249with one of these commands:
250
251:kbd:`$ apt-get install python-html5lib`
252
253:kbd:`$ easy_install html5lib`
254
255:kbd:`$ pip install html5lib`
256
257This table summarizes the advantages and disadvantages of each parser library:
258
259+----------------------+--------------------------------------------+--------------------------------+--------------------------+
260| Parser               | Typical usage                              | Advantages                     | Disadvantages            |
261+----------------------+--------------------------------------------+--------------------------------+--------------------------+
262| Python's html.parser | ``BeautifulSoup(markup, "html.parser")``   | * Batteries included           | * Not very lenient       |
263|                      |                                            | * Decent speed                 |   (before Python 2.7.3   |
264|                      |                                            | * Lenient (as of Python 2.7.3  |   or 3.2.2)              |
265|                      |                                            |   and 3.2.)                    |                          |
266+----------------------+--------------------------------------------+--------------------------------+--------------------------+
267| lxml's HTML parser   | ``BeautifulSoup(markup, "lxml")``          | * Very fast                    | * External C dependency  |
268|                      |                                            | * Lenient                      |                          |
269+----------------------+--------------------------------------------+--------------------------------+--------------------------+
270| lxml's XML parser    | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast                    | * External C dependency  |
271|                      | ``BeautifulSoup(markup, "xml")``           | * The only currently supported |                          |
272|                      |                                            |   XML parser                   |                          |
273+----------------------+--------------------------------------------+--------------------------------+--------------------------+
274| html5lib             | ``BeautifulSoup(markup, "html5lib")``      | * Extremely lenient            | * Very slow              |
275|                      |                                            | * Parses pages the same way a  | * External Python        |
276|                      |                                            |   web browser does             |   dependency             |
277|                      |                                            | * Creates valid HTML5          |                          |
278+----------------------+--------------------------------------------+--------------------------------+--------------------------+
279
280If you can, I recommend you install and use lxml for speed. If you're
281using a version of Python 2 earlier than 2.7.3, or a version of Python
2823 earlier than 3.2.2, it's `essential` that you install lxml or
283html5lib--Python's built-in HTML parser is just not very good in older
284versions.
285
286Note that if a document is invalid, different parsers will generate
287different Beautiful Soup trees for it. See `Differences
288between parsers`_ for details.
289
290Making the soup
291===============
292
293To parse a document, pass it into the ``BeautifulSoup``
294constructor. You can pass in a string or an open filehandle::
295
296 from bs4 import BeautifulSoup
297
298 soup = BeautifulSoup(open("index.html"))
299
300 soup = BeautifulSoup("<html>data</html>")
301
302First, the document is converted to Unicode, and HTML entities are
303converted to Unicode characters::
304
305 BeautifulSoup("Sacr&eacute; bleu!")
306 <html><head></head><body>Sacré bleu!</body></html>
307
308Beautiful Soup then parses the document using the best available
309parser. It will use an HTML parser unless you specifically tell it to
310use an XML parser. (See `Parsing XML`_.)
311
312Kinds of objects
313================
314
315Beautiful Soup transforms a complex HTML document into a complex tree
316of Python objects. But you'll only ever have to deal with about four
317`kinds` of objects.
318
319.. _Tag:
320
321``Tag``
322-------
323
324A ``Tag`` object corresponds to an XML or HTML tag in the original document::
325
326 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
327 tag = soup.b
328 type(tag)
329 # <class 'bs4.element.Tag'>
330
331Tags have a lot of attributes and methods, and I'll cover most of them
332in `Navigating the tree`_ and `Searching the tree`_. For now, the most
333important features of a tag are its name and attributes.
334
335Name
336^^^^
337
338Every tag has a name, accessible as ``.name``::
339
340 tag.name
341 # u'b'
342
343If you change a tag's name, the change will be reflected in any HTML
344markup generated by Beautiful Soup::
345
346 tag.name = "blockquote"
347 tag
348 # <blockquote class="boldest">Extremely bold</blockquote>
349
350Attributes
351^^^^^^^^^^
352
353A tag may have any number of attributes. The tag ``<b
354class="boldest">`` has an attribute "class" whose value is
355"boldest". You can access a tag's attributes by treating the tag like
356a dictionary::
357
358 tag['class']
359 # u'boldest'
360
361You can access that dictionary directly as ``.attrs``::
362
363 tag.attrs
364 # {u'class': u'boldest'}
365
366You can add, remove, and modify a tag's attributes. Again, this is
367done by treating the tag as a dictionary::
368
369 tag['class'] = 'verybold'
370 tag['id'] = 1
371 tag
372 # <blockquote class="verybold" id="1">Extremely bold</blockquote>
373
374 del tag['class']
375 del tag['id']
376 tag
377 # <blockquote>Extremely bold</blockquote>
378
379 tag['class']
380 # KeyError: 'class'
381 print(tag.get('class'))
382 # None
383
384.. _multivalue:
385
386Multi-valued attributes
387&&&&&&&&&&&&&&&&&&&&&&&
388
389HTML 4 defines a few attributes that can have multiple values. HTML 5
390removes a couple of them, but defines a few more. The most common
391multi-valued attribute is ``class`` (that is, a tag can have more than
392one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
393``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
394of a multi-valued attribute as a list::
395
396 css_soup = BeautifulSoup('<p class="body strikeout"></p>')
397 css_soup.p['class']
398 # ["body", "strikeout"]
399
400 css_soup = BeautifulSoup('<p class="body"></p>')
401 css_soup.p['class']
402 # ["body"]
403
404If an attribute `looks` like it has more than one value, but it's not
405a multi-valued attribute as defined by any version of the HTML
406standard, Beautiful Soup will leave the attribute alone::
407
408 id_soup = BeautifulSoup('<p id="my id"></p>')
409 id_soup.p['id']
410 # 'my id'
411
412When you turn a tag back into a string, multiple attribute values are
413consolidated::
414
415 rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
416 rel_soup.a['rel']
417 # ['index']
418 rel_soup.a['rel'] = ['index', 'contents']
419 print(rel_soup.p)
420 # <p>Back to the <a rel="index contents">homepage</a></p>
421
422If you parse a document as XML, there are no multi-valued attributes::
423
424 xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
425 xml_soup.p['class']
426 # u'body strikeout'
427
428
429
430``NavigableString``
431-------------------
432
433A string corresponds to a bit of text within a tag. Beautiful Soup
434uses the ``NavigableString`` class to contain these bits of text::
435
436 tag.string
437 # u'Extremely bold'
438 type(tag.string)
439 # <class 'bs4.element.NavigableString'>
440
441A ``NavigableString`` is just like a Python Unicode string, except
442that it also supports some of the features described in `Navigating
443the tree`_ and `Searching the tree`_. You can convert a
444``NavigableString`` to a Unicode string with ``unicode()``::
445
446 unicode_string = unicode(tag.string)
447 unicode_string
448 # u'Extremely bold'
449 type(unicode_string)
450 # <type 'unicode'>
451
452You can't edit a string in place, but you can replace one string with
453another, using :ref:`replace_with`::
454
455 tag.string.replace_with("No longer bold")
456 tag
457 # <blockquote>No longer bold</blockquote>
458
459``NavigableString`` supports most of the features described in
460`Navigating the tree`_ and `Searching the tree`_, but not all of
461them. In particular, since a string can't contain anything (the way a
462tag may contain a string or another tag), strings don't support the
463``.contents`` or ``.string`` attributes, or the ``find()`` method.
464
465If you want to use a ``NavigableString`` outside of Beautiful Soup,
466you should call ``unicode()`` on it to turn it into a normal Python
467Unicode string. If you don't, your string will carry around a
468reference to the entire Beautiful Soup parse tree, even when you're
469done using Beautiful Soup. This is a big waste of memory.
470
471``BeautifulSoup``
472-----------------
473
474The ``BeautifulSoup`` object itself represents the document as a
475whole. For most purposes, you can treat it as a :ref:`Tag`
476object. This means it supports most of the methods described in
477`Navigating the tree`_ and `Searching the tree`_.
478
479Since the ``BeautifulSoup`` object doesn't correspond to an actual
480HTML or XML tag, it has no name and no attributes. But sometimes it's
481useful to look at its ``.name``, so it's been given the special
482``.name`` "[document]"::
483
484 soup.name
485 # u'[document]'
486
487Comments and other special strings
488----------------------------------
489
490``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
491everything you'll see in an HTML or XML file, but there are a few
492leftover bits. The only one you'll probably ever need to worry about
493is the comment::
494
495 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
496 soup = BeautifulSoup(markup)
497 comment = soup.b.string
498 type(comment)
499 # <class 'bs4.element.Comment'>
500
501The ``Comment`` object is just a special type of ``NavigableString``::
502
503 comment
504 # u'Hey, buddy. Want to buy a used parser'
505
506But when it appears as part of an HTML document, a ``Comment`` is
507displayed with special formatting::
508
509 print(soup.b.prettify())
510 # <b>
511 #  <!--Hey, buddy. Want to buy a used parser?-->
512 # </b>
513
514Beautiful Soup defines classes for anything else that might show up in
515an XML document: ``CData``, ``ProcessingInstruction``,
516``Declaration``, and ``Doctype``. Just like ``Comment``, these classes
517are subclasses of ``NavigableString`` that add something extra to the
518string. Here's an example that replaces the comment with a CDATA
519block::
520
521 from bs4 import CData
522 cdata = CData("A CDATA block")
523 comment.replace_with(cdata)
524
525 print(soup.b.prettify())
526 # <b>
527 #  <![CDATA[A CDATA block]]>
528 # </b>
529
530
531Navigating the tree
532===================
533
534Here's the "Three sisters" HTML document again::
535
536 html_doc = """
537 <html><head><title>The Dormouse's story</title></head>
538
539 <p class="title"><b>The Dormouse's story</b></p>
540
541 <p class="story">Once upon a time there were three little sisters; and their names were
542 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
543 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
544 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
545 and they lived at the bottom of a well.</p>
546
547 <p class="story">...</p>
548 """
549
550 from bs4 import BeautifulSoup
551 soup = BeautifulSoup(html_doc)
552
553I'll use this as an example to show you how to move from one part of
554a document to another.
555
556Going down
557----------
558
559Tags may contain strings and other tags. These elements are the tag's
560`children`. Beautiful Soup provides a lot of different attributes for
561navigating and iterating over a tag's children.
562
563Note that Beautiful Soup strings don't support any of these
564attributes, because a string can't have children.
565
566Navigating using tag names
567^^^^^^^^^^^^^^^^^^^^^^^^^^
568
569The simplest way to navigate the parse tree is to say the name of the
570tag you want. If you want the <head> tag, just say ``soup.head``::
571
572 soup.head
573 # <head><title>The Dormouse's story</title></head>
574
575 soup.title
576 # <title>The Dormouse's story</title>
577
578You can do use this trick again and again to zoom in on a certain part
579of the parse tree. This code gets the first <b> tag beneath the <body> tag::
580
581 soup.body.b
582 # <b>The Dormouse's story</b>
583
584Using a tag name as an attribute will give you only the `first` tag by that
585name::
586
587 soup.a
588 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
589
590If you need to get `all` the <a> tags, or anything more complicated
591than the first tag with a certain name, you'll need to use one of the
592methods described in `Searching the tree`_, such as `find_all()`::
593
594 soup.find_all('a')
595 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
596 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
597 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
598
599``.contents`` and ``.children``
600^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
601
602A tag's children are available in a list called ``.contents``::
603
604 head_tag = soup.head
605 head_tag
606 # <head><title>The Dormouse's story</title></head>
607
608 head_tag.contents
609 [<title>The Dormouse's story</title>]
610
611 title_tag = head_tag.contents[0]
612 title_tag
613 # <title>The Dormouse's story</title>
614 title_tag.contents
615 # [u'The Dormouse's story']
616
617The ``BeautifulSoup`` object itself has children. In this case, the
618<html> tag is the child of the ``BeautifulSoup`` object.::
619
620 len(soup.contents)
621 # 1
622 soup.contents[0].name
623 # u'html'
624
625A string does not have ``.contents``, because it can't contain
626anything::
627
628 text = title_tag.contents[0]
629 text.contents
630 # AttributeError: 'NavigableString' object has no attribute 'contents'
631
632Instead of getting them as a list, you can iterate over a tag's
633children using the ``.children`` generator::
634
635 for child in title_tag.children:
636     print(child)
637 # The Dormouse's story
638
639``.descendants``
640^^^^^^^^^^^^^^^^
641
642The ``.contents`` and ``.children`` attributes only consider a tag's
643`direct` children. For instance, the <head> tag has a single direct
644child--the <title> tag::
645
646 head_tag.contents
647 # [<title>The Dormouse's story</title>]
648
649But the <title> tag itself has a child: the string "The Dormouse's
650story". There's a sense in which that string is also a child of the
651<head> tag. The ``.descendants`` attribute lets you iterate over `all`
652of a tag's children, recursively: its direct children, the children of
653its direct children, and so on::
654
655 for child in head_tag.descendants:
656     print(child)
657 # <title>The Dormouse's story</title>
658 # The Dormouse's story
659
660The <head> tag has only one child, but it has two descendants: the
661<title> tag and the <title> tag's child. The ``BeautifulSoup`` object
662only has one direct child (the <html> tag), but it has a whole lot of
663descendants::
664
665 len(list(soup.children))
666 # 1
667 len(list(soup.descendants))
668 # 25
669
670.. _.string:
671
672``.string``
673^^^^^^^^^^^
674
675If a tag has only one child, and that child is a ``NavigableString``,
676the child is made available as ``.string``::
677
678 title_tag.string
679 # u'The Dormouse's story'
680
681If a tag's only child is another tag, and `that` tag has a
682``.string``, then the parent tag is considered to have the same
683``.string`` as its child::
684
685 head_tag.contents
686 # [<title>The Dormouse's story</title>]
687
688 head_tag.string
689 # u'The Dormouse's story'
690
691If a tag contains more than one thing, then it's not clear what
692``.string`` should refer to, so ``.string`` is defined to be
693``None``::
694
695 print(soup.html.string)
696 # None
697
698.. _string-generators:
699
700``.strings`` and ``stripped_strings``
701^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
702
703If there's more than one thing inside a tag, you can still look at
704just the strings. Use the ``.strings`` generator::
705
706 for string in soup.strings:
707     print(repr(string))
708 # u"The Dormouse's story"
709 # u'\n\n'
710 # u"The Dormouse's story"
711 # u'\n\n'
712 # u'Once upon a time there were three little sisters; and their names were\n'
713 # u'Elsie'
714 # u',\n'
715 # u'Lacie'
716 # u' and\n'
717 # u'Tillie'
718 # u';\nand they lived at the bottom of a well.'
719 # u'\n\n'
720 # u'...'
721 # u'\n'
722
723These strings tend to have a lot of extra whitespace, which you can
724remove by using the ``.stripped_strings`` generator instead::
725
726 for string in soup.stripped_strings:
727     print(repr(string))
728 # u"The Dormouse's story"
729 # u"The Dormouse's story"
730 # u'Once upon a time there were three little sisters; and their names were'
731 # u'Elsie'
732 # u','
733 # u'Lacie'
734 # u'and'
735 # u'Tillie'
736 # u';\nand they lived at the bottom of a well.'
737 # u'...'
738
739Here, strings consisting entirely of whitespace are ignored, and
740whitespace at the beginning and end of strings is removed.
741
742Going up
743--------
744
745Continuing the "family tree" analogy, every tag and every string has a
746`parent`: the tag that contains it.
747
748.. _.parent:
749
750``.parent``
751^^^^^^^^^^^
752
753You can access an element's parent with the ``.parent`` attribute. In
754the example "three sisters" document, the <head> tag is the parent
755of the <title> tag::
756
757 title_tag = soup.title
758 title_tag
759 # <title>The Dormouse's story</title>
760 title_tag.parent
761 # <head><title>The Dormouse's story</title></head>
762
763The title string itself has a parent: the <title> tag that contains
764it::
765
766 title_tag.string.parent
767 # <title>The Dormouse's story</title>
768
769The parent of a top-level tag like <html> is the ``BeautifulSoup`` object
770itself::
771
772 html_tag = soup.html
773 type(html_tag.parent)
774 # <class 'bs4.BeautifulSoup'>
775
776And the ``.parent`` of a ``BeautifulSoup`` object is defined as None::
777
778 print(soup.parent)
779 # None
780
781.. _.parents:
782
783``.parents``
784^^^^^^^^^^^^
785
786You can iterate over all of an element's parents with
787``.parents``. This example uses ``.parents`` to travel from an <a> tag
788buried deep within the document, to the very top of the document::
789
790 link = soup.a
791 link
792 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
793 for parent in link.parents:
794     if parent is None:
795         print(parent)
796     else:
797         print(parent.name)
798 # p
799 # body
800 # html
801 # [document]
802 # None
803
804Going sideways
805--------------
806
807Consider a simple document like this::
808
809 sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
810 print(sibling_soup.prettify())
811 # <html>
812 #  <body>
813 #   <a>
814 #    <b>
815 #     text1
816 #    </b>
817 #    <c>
818 #     text2
819 #    </c>
820 #   </a>
821 #  </body>
822 # </html>
823
824The <b> tag and the <c> tag are at the same level: they're both direct
825children of the same tag. We call them `siblings`. When a document is
826pretty-printed, siblings show up at the same indentation level. You
827can also use this relationship in the code you write.
828
829``.next_sibling`` and ``.previous_sibling``
830^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
831
832You can use ``.next_sibling`` and ``.previous_sibling`` to navigate
833between page elements that are on the same level of the parse tree::
834
835 sibling_soup.b.next_sibling
836 # <c>text2</c>
837
838 sibling_soup.c.previous_sibling
839 # <b>text1</b>
840
841The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``,
842because there's nothing before the <b> tag `on the same level of the
843tree`. For the same reason, the <c> tag has a ``.previous_sibling``
844but no ``.next_sibling``::
845
846 print(sibling_soup.b.previous_sibling)
847 # None
848 print(sibling_soup.c.next_sibling)
849 # None
850
851The strings "text1" and "text2" are `not` siblings, because they don't
852have the same parent::
853
854 sibling_soup.b.string
855 # u'text1'
856
857 print(sibling_soup.b.string.next_sibling)
858 # None
859
860In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a
861tag will usually be a string containing whitespace. Going back to the
862"three sisters" document::
863
864 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
865 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
866 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
867
868You might think that the ``.next_sibling`` of the first <a> tag would
869be the second <a> tag. But actually, it's a string: the comma and
870newline that separate the first <a> tag from the second::
871
872 link = soup.a
873 link
874 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
875
876 link.next_sibling
877 # u',\n'
878
879The second <a> tag is actually the ``.next_sibling`` of the comma::
880
881 link.next_sibling.next_sibling
882 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
883
884.. _sibling-generators:
885
886``.next_siblings`` and ``.previous_siblings``
887^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
888
889You can iterate over a tag's siblings with ``.next_siblings`` or
890``.previous_siblings``::
891
892 for sibling in soup.a.next_siblings:
893     print(repr(sibling))
894 # u',\n'
895 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
896 # u' and\n'
897 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
898 # u'; and they lived at the bottom of a well.'
899 # None
900
901 for sibling in soup.find(id="link3").previous_siblings:
902     print(repr(sibling))
903 # ' and\n'
904 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
905 # u',\n'
906 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
907 # u'Once upon a time there were three little sisters; and their names were\n'
908 # None
909
910Going back and forth
911--------------------
912
913Take a look at the beginning of the "three sisters" document::
914
915 <html><head><title>The Dormouse's story</title></head>
916 <p class="title"><b>The Dormouse's story</b></p>
917
918An HTML parser takes this string of characters and turns it into a
919series of events: "open an <html> tag", "open a <head> tag", "open a
920<title> tag", "add a string", "close the <title> tag", "open a <p>
921tag", and so on. Beautiful Soup offers tools for reconstructing the
922initial parse of the document.
923
924.. _element-generators:
925
926``.next_element`` and ``.previous_element``
927^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
928
929The ``.next_element`` attribute of a string or tag points to whatever
930was parsed immediately afterwards. It might be the same as
931``.next_sibling``, but it's usually drastically different.
932
933Here's the final <a> tag in the "three sisters" document. Its
934``.next_sibling`` is a string: the conclusion of the sentence that was
935interrupted by the start of the <a> tag.::
936
937 last_a_tag = soup.find("a", id="link3")
938 last_a_tag
939 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
940
941 last_a_tag.next_sibling
942 # '; and they lived at the bottom of a well.'
943
944But the ``.next_element`` of that <a> tag, the thing that was parsed
945immediately after the <a> tag, is `not` the rest of that sentence:
946it's the word "Tillie"::
947
948 last_a_tag.next_element
949 # u'Tillie'
950
951That's because in the original markup, the word "Tillie" appeared
952before that semicolon. The parser encountered an <a> tag, then the
953word "Tillie", then the closing </a> tag, then the semicolon and rest of
954the sentence. The semicolon is on the same level as the <a> tag, but the
955word "Tillie" was encountered first.
956
957The ``.previous_element`` attribute is the exact opposite of
958``.next_element``. It points to whatever element was parsed
959immediately before this one::
960
961 last_a_tag.previous_element
962 # u' and\n'
963 last_a_tag.previous_element.next_element
964 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
965
966``.next_elements`` and ``.previous_elements``
967^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
968
969You should get the idea by now. You can use these iterators to move
970forward or backward in the document as it was parsed::
971
972 for element in last_a_tag.next_elements:
973     print(repr(element))
974 # u'Tillie'
975 # u';\nand they lived at the bottom of a well.'
976 # u'\n\n'
977 # <p class="story">...</p>
978 # u'...'
979 # u'\n'
980 # None
981
982Searching the tree
983==================
984
985Beautiful Soup defines a lot of methods for searching the parse tree,
986but they're all very similar. I'm going to spend a lot of time explaining
987the two most popular methods: ``find()`` and ``find_all()``. The other
988methods take almost exactly the same arguments, so I'll just cover
989them briefly.
990
991Once again, I'll be using the "three sisters" document as an example::
992
993 html_doc = """
994 <html><head><title>The Dormouse's story</title></head>
995
996 <p class="title"><b>The Dormouse's story</b></p>
997
998 <p class="story">Once upon a time there were three little sisters; and their names were
999 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
1000 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
1001 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
1002 and they lived at the bottom of a well.</p>
1003
1004 <p class="story">...</p>
1005 """
1006
1007 from bs4 import BeautifulSoup
1008 soup = BeautifulSoup(html_doc)
1009
1010By passing in a filter to an argument like ``find_all()``, you can
1011zoom in on the parts of the document you're interested in.
1012
1013Kinds of filters
1014----------------
1015
1016Before talking in detail about ``find_all()`` and similar methods, I
1017want to show examples of different filters you can pass into these
1018methods. These filters show up again and again, throughout the
1019search API. You can use them to filter based on a tag's name,
1020on its attributes, on the text of a string, or on some combination of
1021these.
1022
1023.. _a string:
1024
1025A string
1026^^^^^^^^
1027
1028The simplest filter is a string. Pass a string to a search method and
1029Beautiful Soup will perform a match against that exact string. This
1030code finds all the <b> tags in the document::
1031
1032 soup.find_all('b')
1033 # [<b>The Dormouse's story</b>]
1034
1035If you pass in a byte string, Beautiful Soup will assume the string is
1036encoded as UTF-8. You can avoid this by passing in a Unicode string instead.
1037
1038.. _a regular expression:
1039
1040A regular expression
1041^^^^^^^^^^^^^^^^^^^^
1042
1043If you pass in a regular expression object, Beautiful Soup will filter
1044against that regular expression using its ``match()`` method. This code
1045finds all the tags whose names start with the letter "b"; in this
1046case, the <body> tag and the <b> tag::
1047
1048 import re
1049 for tag in soup.find_all(re.compile("^b")):
1050     print(tag.name)
1051 # body
1052 # b
1053
1054This code finds all the tags whose names contain the letter 't'::
1055
1056 for tag in soup.find_all(re.compile("t")):
1057     print(tag.name)
1058 # html
1059 # title
1060
1061.. _a list:
1062
1063A list
1064^^^^^^
1065
1066If you pass in a list, Beautiful Soup will allow a string match
1067against `any` item in that list. This code finds all the <a> tags
1068`and` all the <b> tags::
1069
1070 soup.find_all(["a", "b"])
1071 # [<b>The Dormouse's story</b>,
1072 #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1073 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1074 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1075
1076.. _the value True:
1077
1078``True``
1079^^^^^^^^
1080
1081The value ``True`` matches everything it can. This code finds `all`
1082the tags in the document, but none of the text strings::
1083
1084 for tag in soup.find_all(True):
1085     print(tag.name)
1086 # html
1087 # head
1088 # title
1089 # body
1090 # p
1091 # b
1092 # p
1093 # a
1094 # a
1095 # a
1096 # p
1097
1098.. a function:
1099
1100A function
1101^^^^^^^^^^
1102
1103If none of the other matches work for you, define a function that
1104takes an element as its only argument. The function should return
1105``True`` if the argument matches, and ``False`` otherwise.
1106
1107Here's a function that returns ``True`` if a tag defines the "class"
1108attribute but doesn't define the "id" attribute::
1109
1110 def has_class_but_no_id(tag):
1111     return tag.has_attr('class') and not tag.has_attr('id')
1112
1113Pass this function into ``find_all()`` and you'll pick up all the <p>
1114tags::
1115
1116 soup.find_all(has_class_but_no_id)
1117 # [<p class="title"><b>The Dormouse's story</b></p>,
1118 #  <p class="story">Once upon a time there were...</p>,
1119 #  <p class="story">...</p>]
1120
1121This function only picks up the <p> tags. It doesn't pick up the <a>
1122tags, because those tags define both "class" and "id". It doesn't pick
1123up tags like <html> and <title>, because those tags don't define
1124"class".
1125
1126Here's a function that returns ``True`` if a tag is surrounded by
1127string objects::
1128
1129 from bs4 import NavigableString
1130 def surrounded_by_strings(tag):
1131     return (isinstance(tag.next_element, NavigableString)
1132             and isinstance(tag.previous_element, NavigableString))
1133
1134 for tag in soup.find_all(surrounded_by_strings):
1135     print tag.name
1136 # p
1137 # a
1138 # a
1139 # a
1140 # p
1141
1142Now we're ready to look at the search methods in detail.
1143
1144``find_all()``
1145--------------
1146
1147Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
1148<recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1149
1150The ``find_all()`` method looks through a tag's descendants and
1151retrieves `all` descendants that match your filters. I gave several
1152examples in `Kinds of filters`_, but here are a few more::
1153
1154 soup.find_all("title")
1155 # [<title>The Dormouse's story</title>]
1156
1157 soup.find_all("p", "title")
1158 # [<p class="title"><b>The Dormouse's story</b></p>]
1159
1160 soup.find_all("a")
1161 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1162 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1163 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1164
1165 soup.find_all(id="link2")
1166 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1167
1168 import re
1169 soup.find(text=re.compile("sisters"))
1170 # u'Once upon a time there were three little sisters; and their names were\n'
1171
1172Some of these should look familiar, but others are new. What does it
1173mean to pass in a value for ``text``, or ``id``? Why does
1174``find_all("p", "title")`` find a <p> tag with the CSS class "title"?
1175Let's look at the arguments to ``find_all()``.
1176
1177.. _name:
1178
1179The ``name`` argument
1180^^^^^^^^^^^^^^^^^^^^^
1181
1182Pass in a value for ``name`` and you'll tell Beautiful Soup to only
1183consider tags with certain names. Text strings will be ignored, as
1184will tags whose names that don't match.
1185
1186This is the simplest usage::
1187
1188 soup.find_all("title")
1189 # [<title>The Dormouse's story</title>]
1190
1191Recall from `Kinds of filters`_ that the value to ``name`` can be `a
1192string`_, `a regular expression`_, `a list`_, `a function`_, or `the value
1193True`_.
1194
1195.. _kwargs:
1196
1197The keyword arguments
1198^^^^^^^^^^^^^^^^^^^^^
1199
1200Any argument that's not recognized will be turned into a filter on one
1201of a tag's attributes. If you pass in a value for an argument called ``id``,
1202Beautiful Soup will filter against each tag's 'id' attribute::
1203
1204 soup.find_all(id='link2')
1205 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1206
1207If you pass in a value for ``href``, Beautiful Soup will filter
1208against each tag's 'href' attribute::
1209
1210 soup.find_all(href=re.compile("elsie"))
1211 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1212
1213You can filter an attribute based on `a string`_, `a regular
1214expression`_, `a list`_, `a function`_, or `the value True`_.
1215
1216This code finds all tags whose ``id`` attribute has a value,
1217regardless of what the value is::
1218
1219 soup.find_all(id=True)
1220 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1221 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1222 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1223
1224You can filter multiple attributes at once by passing in more than one
1225keyword argument::
1226
1227 soup.find_all(href=re.compile("elsie"), id='link1')
1228 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
1229
1230Some attributes, like the data-* attributes in HTML 5, have names that
1231can't be used as the names of keyword arguments::
1232
1233 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
1234 data_soup.find_all(data-foo="value")
1235 # SyntaxError: keyword can't be an expression
1236
1237You can use these attributes in searches by putting them into a
1238dictionary and passing the dictionary into ``find_all()`` as the
1239``attrs`` argument::
1240
1241 data_soup.find_all(attrs={"data-foo": "value"})
1242 # [<div data-foo="value">foo!</div>]
1243
1244.. _attrs:
1245
1246Searching by CSS class
1247^^^^^^^^^^^^^^^^^^^^^^
1248
1249It's very useful to search for a tag that has a certain CSS class, but
1250the name of the CSS attribute, "class", is a reserved word in
1251Python. Using ``class`` as a keyword argument will give you a syntax
1252error. As of Beautiful Soup 4.1.2, you can search by CSS class using
1253the keyword argument ``class_``::
1254
1255 soup.find_all("a", class_="sister")
1256 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1257 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1258 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1259
1260As with any keyword argument, you can pass ``class_`` a string, a regular
1261expression, a function, or ``True``::
1262
1263 soup.find_all(class_=re.compile("itl"))
1264 # [<p class="title"><b>The Dormouse's story</b></p>]
1265
1266 def has_six_characters(css_class):
1267     return css_class is not None and len(css_class) == 6
1268
1269 soup.find_all(class_=has_six_characters)
1270 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1271 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1272 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1273
1274:ref:`Remember <multivalue>` that a single tag can have multiple
1275values for its "class" attribute. When you search for a tag that
1276matches a certain CSS class, you're matching against `any` of its CSS
1277classes::
1278
1279 css_soup = BeautifulSoup('<p class="body strikeout"></p>')
1280 css_soup.find_all("p", class_="strikeout")
1281 # [<p class="body strikeout"></p>]
1282
1283 css_soup.find_all("p", class_="body")
1284 # [<p class="body strikeout"></p>]
1285
1286You can also search for the exact string value of the ``class`` attribute::
1287
1288 css_soup.find_all("p", class_="body strikeout")
1289 # [<p class="body strikeout"></p>]
1290
1291But searching for variants of the string value won't work::
1292
1293 css_soup.find_all("p", class_="strikeout body")
1294 # []
1295
1296If you want to search for tags that match two or more CSS classes, you
1297should use a CSS selector::
1298
1299 css_soup.select("p.strikeout.body")
1300 # [<p class="body strikeout"></p>]
1301
1302In older versions of Beautiful Soup, which don't have the ``class_``
1303shortcut, you can use the ``attrs`` trick mentioned above. Create a
1304dictionary whose value for "class" is the string (or regular
1305expression, or whatever) you want to search for::
1306
1307 soup.find_all("a", attrs={"class": "sister"})
1308 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1309 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1310 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1311
1312.. _text:
1313
1314The ``text`` argument
1315^^^^^^^^^^^^^^^^^^^^^
1316
1317With ``text`` you can search for strings instead of tags. As with
1318``name`` and the keyword arguments, you can pass in `a string`_, `a
1319regular expression`_, `a list`_, `a function`_, or `the value True`_.
1320Here are some examples::
1321
1322 soup.find_all(text="Elsie")
1323 # [u'Elsie']
1324
1325 soup.find_all(text=["Tillie", "Elsie", "Lacie"])
1326 # [u'Elsie', u'Lacie', u'Tillie']
1327
1328 soup.find_all(text=re.compile("Dormouse"))
1329 [u"The Dormouse's story", u"The Dormouse's story"]
1330
1331 def is_the_only_string_within_a_tag(s):
1332     """Return True if this string is the only child of its parent tag."""
1333     return (s == s.parent.string)
1334
1335 soup.find_all(text=is_the_only_string_within_a_tag)
1336 # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
1337
1338Although ``text`` is for finding strings, you can combine it with
1339arguments that find tags: Beautiful Soup will find all tags whose
1340``.string`` matches your value for ``text``. This code finds the <a>
1341tags whose ``.string`` is "Elsie"::
1342
1343 soup.find_all("a", text="Elsie")
1344 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
1345
1346.. _limit:
1347
1348The ``limit`` argument
1349^^^^^^^^^^^^^^^^^^^^^^
1350
1351``find_all()`` returns all the tags and strings that match your
1352filters. This can take a while if the document is large. If you don't
1353need `all` the results, you can pass in a number for ``limit``. This
1354works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
1355stop gathering results after it's found a certain number.
1356
1357There are three links in the "three sisters" document, but this code
1358only finds the first two::
1359
1360 soup.find_all("a", limit=2)
1361 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1362 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1363
1364.. _recursive:
1365
1366The ``recursive`` argument
1367^^^^^^^^^^^^^^^^^^^^^^^^^^
1368
1369If you call ``mytag.find_all()``, Beautiful Soup will examine all the
1370descendants of ``mytag``: its children, its children's children, and
1371so on. If you only want Beautiful Soup to consider direct children,
1372you can pass in ``recursive=False``. See the difference here::
1373
1374 soup.html.find_all("title")
1375 # [<title>The Dormouse's story</title>]
1376
1377 soup.html.find_all("title", recursive=False)
1378 # []
1379
1380Here's that part of the document::
1381
1382 <html>
1383  <head>
1384   <title>
1385    The Dormouse's story
1386   </title>
1387  </head>
1388 ...
1389
1390The <title> tag is beneath the <html> tag, but it's not `directly`
1391beneath the <html> tag: the <head> tag is in the way. Beautiful Soup
1392finds the <title> tag when it's allowed to look at all descendants of
1393the <html> tag, but when ``recursive=False`` restricts it to the
1394<html> tag's immediate children, it finds nothing.
1395
1396Beautiful Soup offers a lot of tree-searching methods (covered below),
1397and they mostly take the same arguments as ``find_all()``: ``name``,
1398``attrs``, ``text``, ``limit``, and the keyword arguments. But the
1399``recursive`` argument is different: ``find_all()`` and ``find()`` are
1400the only methods that support it. Passing ``recursive=False`` into a
1401method like ``find_parents()`` wouldn't be very useful.
1402
1403Calling a tag is like calling ``find_all()``
1404--------------------------------------------
1405
1406Because ``find_all()`` is the most popular method in the Beautiful
1407Soup search API, you can use a shortcut for it. If you treat the
1408``BeautifulSoup`` object or a ``Tag`` object as though it were a
1409function, then it's the same as calling ``find_all()`` on that
1410object. These two lines of code are equivalent::
1411
1412 soup.find_all("a")
1413 soup("a")
1414
1415These two lines are also equivalent::
1416
1417 soup.title.find_all(text=True)
1418 soup.title(text=True)
1419
1420``find()``
1421----------
1422
1423Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
1424<recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1425
1426The ``find_all()`` method scans the entire document looking for
1427results, but sometimes you only want to find one result. If you know a
1428document only has one <body> tag, it's a waste of time to scan the
1429entire document looking for more. Rather than passing in ``limit=1``
1430every time you call ``find_all``, you can use the ``find()``
1431method. These two lines of code are `nearly` equivalent::
1432
1433 soup.find_all('title', limit=1)
1434 # [<title>The Dormouse's story</title>]
1435
1436 soup.find('title')
1437 # <title>The Dormouse's story</title>
1438
1439The only difference is that ``find_all()`` returns a list containing
1440the single result, and ``find()`` just returns the result.
1441
1442If ``find_all()`` can't find anything, it returns an empty list. If
1443``find()`` can't find anything, it returns ``None``::
1444
1445 print(soup.find("nosuchtag"))
1446 # None
1447
1448Remember the ``soup.head.title`` trick from `Navigating using tag
1449names`_? That trick works by repeatedly calling ``find()``::
1450
1451 soup.head.title
1452 # <title>The Dormouse's story</title>
1453
1454 soup.find("head").find("title")
1455 # <title>The Dormouse's story</title>
1456
1457``find_parents()`` and ``find_parent()``
1458----------------------------------------
1459
1460Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1461
1462Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1463
1464I spent a lot of time above covering ``find_all()`` and
1465``find()``. The Beautiful Soup API defines ten other methods for
1466searching the tree, but don't be afraid. Five of these methods are
1467basically the same as ``find_all()``, and the other five are basically
1468the same as ``find()``. The only differences are in what parts of the
1469tree they search.
1470
1471First let's consider ``find_parents()`` and
1472``find_parent()``. Remember that ``find_all()`` and ``find()`` work
1473their way down the tree, looking at tag's descendants. These methods
1474do the opposite: they work their way `up` the tree, looking at a tag's
1475(or a string's) parents. Let's try them out, starting from a string
1476buried deep in the "three daughters" document::
1477
1478  a_string = soup.find(text="Lacie")
1479  a_string
1480  # u'Lacie'
1481
1482  a_string.find_parents("a")
1483  # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1484
1485  a_string.find_parent("p")
1486  # <p class="story">Once upon a time there were three little sisters; and their names were
1487  #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1488  #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
1489  #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
1490  #  and they lived at the bottom of a well.</p>
1491
1492  a_string.find_parents("p", class="title")
1493  # []
1494
1495One of the three <a> tags is the direct parent of the string in
1496question, so our search finds it. One of the three <p> tags is an
1497indirect parent of the string, and our search finds that as
1498well. There's a <p> tag with the CSS class "title" `somewhere` in the
1499document, but it's not one of this string's parents, so we can't find
1500it with ``find_parents()``.
1501
1502You may have made the connection between ``find_parent()`` and
1503``find_parents()``, and the `.parent`_ and `.parents`_ attributes
1504mentioned earlier. The connection is very strong. These search methods
1505actually use ``.parents`` to iterate over all the parents, and check
1506each one against the provided filter to see if it matches.
1507
1508``find_next_siblings()`` and ``find_next_sibling()``
1509----------------------------------------------------
1510
1511Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1512
1513Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1514
1515These methods use :ref:`.next_siblings <sibling-generators>` to
1516iterate over the rest of an element's siblings in the tree. The
1517``find_next_siblings()`` method returns all the siblings that match,
1518and ``find_next_sibling()`` only returns the first one::
1519
1520 first_link = soup.a
1521 first_link
1522 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1523
1524 first_link.find_next_siblings("a")
1525 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1526 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1527
1528 first_story_paragraph = soup.find("p", "story")
1529 first_story_paragraph.find_next_sibling("p")
1530 # <p class="story">...</p>
1531
1532``find_previous_siblings()`` and ``find_previous_sibling()``
1533------------------------------------------------------------
1534
1535Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1536
1537Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1538
1539These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's
1540siblings that precede it in the tree. The ``find_previous_siblings()``
1541method returns all the siblings that match, and
1542``find_previous_sibling()`` only returns the first one::
1543
1544 last_link = soup.find("a", id="link3")
1545 last_link
1546 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
1547
1548 last_link.find_previous_siblings("a")
1549 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1550 #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1551
1552 first_story_paragraph = soup.find("p", "story")
1553 first_story_paragraph.find_previous_sibling("p")
1554 # <p class="title"><b>The Dormouse's story</b></p>
1555
1556
1557``find_all_next()`` and ``find_next()``
1558---------------------------------------
1559
1560Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1561
1562Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1563
1564These methods use :ref:`.next_elements <element-generators>` to
1565iterate over whatever tags and strings that come after it in the
1566document. The ``find_all_next()`` method returns all matches, and
1567``find_next()`` only returns the first match::
1568
1569 first_link = soup.a
1570 first_link
1571 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1572
1573 first_link.find_all_next(text=True)
1574 # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
1575 #  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
1576
1577 first_link.find_next("p")
1578 # <p class="story">...</p>
1579
1580In the first example, the string "Elsie" showed up, even though it was
1581contained within the <a> tag we started from. In the second example,
1582the last <p> tag in the document showed up, even though it's not in
1583the same part of the tree as the <a> tag we started from. For these
1584methods, all that matters is that an element match the filter, and
1585show up later in the document than the starting element.
1586
1587``find_all_previous()`` and ``find_previous()``
1588-----------------------------------------------
1589
1590Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
1591
1592Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
1593
1594These methods use :ref:`.previous_elements <element-generators>` to
1595iterate over the tags and strings that came before it in the
1596document. The ``find_all_previous()`` method returns all matches, and
1597``find_previous()`` only returns the first match::
1598
1599 first_link = soup.a
1600 first_link
1601 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
1602
1603 first_link.find_all_previous("p")
1604 # [<p class="story">Once upon a time there were three little sisters; ...</p>,
1605 #  <p class="title"><b>The Dormouse's story</b></p>]
1606
1607 first_link.find_previous("title")
1608 # <title>The Dormouse's story</title>
1609
1610The call to ``find_all_previous("p")`` found the first paragraph in
1611the document (the one with class="title"), but it also finds the
1612second paragraph, the <p> tag that contains the <a> tag we started
1613with. This shouldn't be too surprising: we're looking at all the tags
1614that show up earlier in the document than the one we started with. A
1615<p> tag that contains an <a> tag must have shown up before the <a>
1616tag it contains.
1617
1618CSS selectors
1619-------------
1620
1621Beautiful Soup supports the most commonly-used `CSS selectors
1622<http://www.w3.org/TR/CSS2/selector.html>`_. Just pass a string into
1623the ``.select()`` method of a ``Tag`` object or the ``BeautifulSoup``
1624object itself.
1625
1626You can find tags::
1627
1628 soup.select("title")
1629 # [<title>The Dormouse's story</title>]
1630
1631 soup.select("p nth-of-type(3)")
1632 # [<p class="story">...</p>]
1633
1634Find tags beneath other tags::
1635
1636 soup.select("body a")
1637 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1638 #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1639 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1640
1641 soup.select("html head title")
1642 # [<title>The Dormouse's story</title>]
1643
1644Find tags `directly` beneath other tags::
1645
1646 soup.select("head > title")
1647 # [<title>The Dormouse's story</title>]
1648
1649 soup.select("p > a")
1650 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1651 #  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
1652 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1653
1654 soup.select("p > a:nth-of-type(2)")
1655 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1656
1657 soup.select("p > #link1")
1658 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1659
1660 soup.select("body > a")
1661 # []
1662
1663Find the siblings of tags::
1664
1665 soup.select("#link1 ~ .sister")
1666 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1667 #  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]
1668
1669 soup.select("#link1 + .sister")
1670 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1671
1672Find tags by CSS class::
1673
1674 soup.select(".sister")
1675 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1676 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1677 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1678
1679 soup.select("[class~=sister]")
1680 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1681 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1682 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1683
1684Find tags by ID::
1685
1686 soup.select("#link1")
1687 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1688
1689 soup.select("a#link2")
1690 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
1691
1692Test for the existence of an attribute::
1693
1694 soup.select('a[href]')
1695 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1696 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1697 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1698
1699Find tags by attribute value::
1700
1701 soup.select('a[href="http://example.com/elsie"]')
1702 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1703
1704 soup.select('a[href^="http://example.com/"]')
1705 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
1706 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
1707 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1708
1709 soup.select('a[href$="tillie"]')
1710 # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1711
1712 soup.select('a[href*=".com/el"]')
1713 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1714
1715Match language codes::
1716
1717 multilingual_markup = """
1718  <p lang="en">Hello</p>
1719  <p lang="en-us">Howdy, y'all</p>
1720  <p lang="en-gb">Pip-pip, old fruit</p>
1721  <p lang="fr">Bonjour mes amis</p>
1722 """
1723 multilingual_soup = BeautifulSoup(multilingual_markup)
1724 multilingual_soup.select('p[lang|=en]')
1725 # [<p lang="en">Hello</p>,
1726 #  <p lang="en-us">Howdy, y'all</p>,
1727 #  <p lang="en-gb">Pip-pip, old fruit</p>]
1728
1729This is a convenience for users who know the CSS selector syntax. You
1730can do all this stuff with the Beautiful Soup API. And if CSS
1731selectors are all you need, you might as well use lxml directly,
1732because it's faster. But this lets you `combine` simple CSS selectors
1733with the Beautiful Soup API.
1734
1735
1736Modifying the tree
1737==================
1738
1739Beautiful Soup's main strength is in searching the parse tree, but you
1740can also modify the tree and write your changes as a new HTML or XML
1741document.
1742
1743Changing tag names and attributes
1744---------------------------------
1745
1746I covered this earlier, in `Attributes`_, but it bears repeating. You
1747can rename a tag, change the values of its attributes, add new
1748attributes, and delete attributes::
1749
1750 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
1751 tag = soup.b
1752
1753 tag.name = "blockquote"
1754 tag['class'] = 'verybold'
1755 tag['id'] = 1
1756 tag
1757 # <blockquote class="verybold" id="1">Extremely bold</blockquote>
1758
1759 del tag['class']
1760 del tag['id']
1761 tag
1762 # <blockquote>Extremely bold</blockquote>
1763
1764
1765Modifying ``.string``
1766---------------------
1767
1768If you set a tag's ``.string`` attribute, the tag's contents are
1769replaced with the string you give::
1770
1771  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1772  soup = BeautifulSoup(markup)
1773
1774  tag = soup.a
1775  tag.string = "New link text."
1776  tag
1777  # <a href="http://example.com/">New link text.</a>
1778
1779Be careful: if the tag contained other tags, they and all their
1780contents will be destroyed.
1781
1782``append()``
1783------------
1784
1785You can add to a tag's contents with ``Tag.append()``. It works just
1786like calling ``.append()`` on a Python list::
1787
1788   soup = BeautifulSoup("<a>Foo</a>")
1789   soup.a.append("Bar")
1790
1791   soup
1792   # <html><head></head><body><a>FooBar</a></body></html>
1793   soup.a.contents
1794   # [u'Foo', u'Bar']
1795
1796``BeautifulSoup.new_string()`` and ``.new_tag()``
1797-------------------------------------------------
1798
1799If you need to add a string to a document, no problem--you can pass a
1800Python string in to ``append()``, or you can call the factory method
1801``BeautifulSoup.new_string()``::
1802
1803   soup = BeautifulSoup("<b></b>")
1804   tag = soup.b
1805   tag.append("Hello")
1806   new_string = soup.new_string(" there")
1807   tag.append(new_string)
1808   tag
1809   # <b>Hello there.</b>
1810   tag.contents
1811   # [u'Hello', u' there']
1812
1813If you want to create a comment or some other subclass of
1814``NavigableString``, pass that class as the second argument to
1815``new_string()``::
1816
1817   from bs4 import Comment
1818   new_comment = soup.new_string("Nice to see you.", Comment)
1819   tag.append(new_comment)
1820   tag
1821   # <b>Hello there<!--Nice to see you.--></b>
1822   tag.contents
1823   # [u'Hello', u' there', u'Nice to see you.']
1824
1825(This is a new feature in Beautiful Soup 4.2.1.)
1826
1827What if you need to create a whole new tag?  The best solution is to
1828call the factory method ``BeautifulSoup.new_tag()``::
1829
1830   soup = BeautifulSoup("<b></b>")
1831   original_tag = soup.b
1832
1833   new_tag = soup.new_tag("a", href="http://www.example.com")
1834   original_tag.append(new_tag)
1835   original_tag
1836   # <b><a href="http://www.example.com"></a></b>
1837
1838   new_tag.string = "Link text."
1839   original_tag
1840   # <b><a href="http://www.example.com">Link text.</a></b>
1841
1842Only the first argument, the tag name, is required.
1843
1844``insert()``
1845------------
1846
1847``Tag.insert()`` is just like ``Tag.append()``, except the new element
1848doesn't necessarily go at the end of its parent's
1849``.contents``. It'll be inserted at whatever numeric position you
1850say. It works just like ``.insert()`` on a Python list::
1851
1852  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1853  soup = BeautifulSoup(markup)
1854  tag = soup.a
1855
1856  tag.insert(1, "but did not endorse ")
1857  tag
1858  # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
1859  tag.contents
1860  # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
1861
1862``insert_before()`` and ``insert_after()``
1863------------------------------------------
1864
1865The ``insert_before()`` method inserts a tag or string immediately
1866before something else in the parse tree::
1867
1868   soup = BeautifulSoup("<b>stop</b>")
1869   tag = soup.new_tag("i")
1870   tag.string = "Don't"
1871   soup.b.string.insert_before(tag)
1872   soup.b
1873   # <b><i>Don't</i>stop</b>
1874
1875The ``insert_after()`` method moves a tag or string so that it
1876immediately follows something else in the parse tree::
1877
1878   soup.b.i.insert_after(soup.new_string(" ever "))
1879   soup.b
1880   # <b><i>Don't</i> ever stop</b>
1881   soup.b.contents
1882   # [<i>Don't</i>, u' ever ', u'stop']
1883
1884``clear()``
1885-----------
1886
1887``Tag.clear()`` removes the contents of a tag::
1888
1889  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1890  soup = BeautifulSoup(markup)
1891  tag = soup.a
1892
1893  tag.clear()
1894  tag
1895  # <a href="http://example.com/"></a>
1896
1897``extract()``
1898-------------
1899
1900``PageElement.extract()`` removes a tag or string from the tree. It
1901returns the tag or string that was extracted::
1902
1903  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1904  soup = BeautifulSoup(markup)
1905  a_tag = soup.a
1906
1907  i_tag = soup.i.extract()
1908
1909  a_tag
1910  # <a href="http://example.com/">I linked to</a>
1911
1912  i_tag
1913  # <i>example.com</i>
1914
1915  print(i_tag.parent)
1916  None
1917
1918At this point you effectively have two parse trees: one rooted at the
1919``BeautifulSoup`` object you used to parse the document, and one rooted
1920at the tag that was extracted. You can go on to call ``extract`` on
1921a child of the element you extracted::
1922
1923  my_string = i_tag.string.extract()
1924  my_string
1925  # u'example.com'
1926
1927  print(my_string.parent)
1928  # None
1929  i_tag
1930  # <i></i>
1931
1932
1933``decompose()``
1934---------------
1935
1936``Tag.decompose()`` removes a tag from the tree, then `completely
1937destroys it and its contents`::
1938
1939  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1940  soup = BeautifulSoup(markup)
1941  a_tag = soup.a
1942
1943  soup.i.decompose()
1944
1945  a_tag
1946  # <a href="http://example.com/">I linked to</a>
1947
1948
1949.. _replace_with:
1950
1951``replace_with()``
1952------------------
1953
1954``PageElement.replace_with()`` removes a tag or string from the tree,
1955and replaces it with the tag or string of your choice::
1956
1957  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1958  soup = BeautifulSoup(markup)
1959  a_tag = soup.a
1960
1961  new_tag = soup.new_tag("b")
1962  new_tag.string = "example.net"
1963  a_tag.i.replace_with(new_tag)
1964
1965  a_tag
1966  # <a href="http://example.com/">I linked to <b>example.net</b></a>
1967
1968``replace_with()`` returns the tag or string that was replaced, so
1969that you can examine it or add it back to another part of the tree.
1970
1971``wrap()``
1972----------
1973
1974``PageElement.wrap()`` wraps an element in the tag you specify. It
1975returns the new wrapper::
1976
1977 soup = BeautifulSoup("<p>I wish I was bold.</p>")
1978 soup.p.string.wrap(soup.new_tag("b"))
1979 # <b>I wish I was bold.</b>
1980
1981 soup.p.wrap(soup.new_tag("div")
1982 # <div><p><b>I wish I was bold.</b></p></div>
1983
1984This method is new in Beautiful Soup 4.0.5.
1985
1986``unwrap()``
1987---------------------------
1988
1989``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with
1990whatever's inside that tag. It's good for stripping out markup::
1991
1992  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
1993  soup = BeautifulSoup(markup)
1994  a_tag = soup.a
1995
1996  a_tag.i.unwrap()
1997  a_tag
1998  # <a href="http://example.com/">I linked to example.com</a>
1999
2000Like ``replace_with()``, ``unwrap()`` returns the tag
2001that was replaced.
2002
2003Output
2004======
2005
2006.. _.prettyprinting:
2007
2008Pretty-printing
2009---------------
2010
2011The ``prettify()`` method will turn a Beautiful Soup parse tree into a
2012nicely formatted Unicode string, with each HTML/XML tag on its own line::
2013
2014  markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
2015  soup = BeautifulSoup(markup)
2016  soup.prettify()
2017  # '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'
2018
2019  print(soup.prettify())
2020  # <html>
2021  #  <head>
2022  #  </head>
2023  #  <body>
2024  #   <a href="http://example.com/">
2025  #    I linked to
2026  #    <i>
2027  #     example.com
2028  #    </i>
2029  #   </a>
2030  #  </body>
2031  # </html>
2032
2033You can call ``prettify()`` on the top-level ``BeautifulSoup`` object,
2034or on any of its ``Tag`` objects::
2035
2036  print(soup.a.prettify())
2037  # <a href="http://example.com/">
2038  #  I linked to
2039  #  <i>
2040  #   example.com
2041  #  </i>
2042  # </a>
2043
2044Non-pretty printing
2045-------------------
2046
2047If you just want a string, with no fancy formatting, you can call
2048``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag``
2049within it::
2050
2051 str(soup)
2052 # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
2053
2054 unicode(soup.a)
2055 # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
2056
2057The ``str()`` function returns a string encoded in UTF-8. See
2058`Encodings`_ for other options.
2059
2060You can also call ``encode()`` to get a bytestring, and ``decode()``
2061to get Unicode.
2062
2063.. _output_formatters:
2064
2065Output formatters
2066-----------------
2067
2068If you give Beautiful Soup a document that contains HTML entities like
2069"&lquot;", they'll be converted to Unicode characters::
2070
2071 soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
2072 unicode(soup)
2073 # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
2074
2075If you then convert the document to a string, the Unicode characters
2076will be encoded as UTF-8. You won't get the HTML entities back::
2077
2078 str(soup)
2079 # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
2080
2081By default, the only characters that are escaped upon output are bare
2082ampersands and angle brackets. These get turned into "&amp;", "&lt;",
2083and "&gt;", so that Beautiful Soup doesn't inadvertently generate
2084invalid HTML or XML::
2085
2086 soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
2087 soup.p
2088 # <p>The law firm of Dewey, Cheatem, &amp; Howe</p>
2089
2090 soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
2091 soup.a
2092 # <a href="http://example.com/?foo=val1&amp;bar=val2">A link</a>
2093
2094You can change this behavior by providing a value for the
2095``formatter`` argument to ``prettify()``, ``encode()``, or
2096``decode()``. Beautiful Soup recognizes four possible values for
2097``formatter``.
2098
2099The default is ``formatter="minimal"``. Strings will only be processed
2100enough to ensure that Beautiful Soup generates valid HTML/XML::
2101
2102 french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
2103 soup = BeautifulSoup(french)
2104 print(soup.prettify(formatter="minimal"))
2105 # <html>
2106 #  <body>
2107 #   <p>
2108 #    Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
2109 #   </p>
2110 #  </body>
2111 # </html>
2112
2113If you pass in ``formatter="html"``, Beautiful Soup will convert
2114Unicode characters to HTML entities whenever possible::
2115
2116 print(soup.prettify(formatter="html"))
2117 # <html>
2118 #  <body>
2119 #   <p>
2120 #    Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
2121 #   </p>
2122 #  </body>
2123 # </html>
2124
2125If you pass in ``formatter=None``, Beautiful Soup will not modify
2126strings at all on output. This is the fastest option, but it may lead
2127to Beautiful Soup generating invalid HTML/XML, as in these examples::
2128
2129 print(soup.prettify(formatter=None))
2130 # <html>
2131 #  <body>
2132 #   <p>
2133 #    Il a dit <<Sacré bleu!>>
2134 #   </p>
2135 #  </body>
2136 # </html>
2137
2138 link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
2139 print(link_soup.a.encode(formatter=None))
2140 # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
2141
2142Finally, if you pass in a function for ``formatter``, Beautiful Soup
2143will call that function once for every string and attribute value in
2144the document. You can do whatever you want in this function. Here's a
2145formatter that converts strings to uppercase and does absolutely
2146nothing else::
2147
2148 def uppercase(str):
2149     return str.upper()
2150
2151 print(soup.prettify(formatter=uppercase))
2152 # <html>
2153 #  <body>
2154 #   <p>
2155 #    IL A DIT <<SACRÉ BLEU!>>
2156 #   </p>
2157 #  </body>
2158 # </html>
2159
2160 print(link_soup.a.prettify(formatter=uppercase))
2161 # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
2162 #  A LINK
2163 # </a>
2164
2165If you're writing your own function, you should know about the
2166``EntitySubstitution`` class in the ``bs4.dammit`` module. This class
2167implements Beautiful Soup's standard formatters as class methods: the
2168"html" formatter is ``EntitySubstitution.substitute_html``, and the
2169"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can
2170use these functions to simulate ``formatter=html`` or
2171``formatter==minimal``, but then do something extra.
2172
2173Here's an example that replaces Unicode characters with HTML entities
2174whenever possible, but `also` converts all strings to uppercase::
2175
2176 from bs4.dammit import EntitySubstitution
2177 def uppercase_and_substitute_html_entities(str):
2178     return EntitySubstitution.substitute_html(str.upper())
2179
2180 print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
2181 # <html>
2182 #  <body>
2183 #   <p>
2184 #    IL A DIT &lt;&lt;SACR&Eacute; BLEU!&gt;&gt;
2185 #   </p>
2186 #  </body>
2187 # </html>
2188
2189One last caveat: if you create a ``CData`` object, the text inside
2190that object is always presented `exactly as it appears, with no
2191formatting`. Beautiful Soup will call the formatter method, just in
2192case you've written a custom method that counts all the strings in the
2193document or something, but it will ignore the return value::
2194
2195 from bs4.element import CData
2196 soup = BeautifulSoup("<a></a>")
2197 soup.a.string = CData("one < three")
2198 print(soup.a.prettify(formatter="xml"))
2199 # <a>
2200 #  <![CDATA[one < three]]>
2201 # </a>
2202
2203
2204``get_text()``
2205--------------
2206
2207If you only want the text part of a document or tag, you can use the
2208``get_text()`` method. It returns all the text in a document or
2209beneath a tag, as a single Unicode string::
2210
2211  markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
2212  soup = BeautifulSoup(markup)
2213
2214  soup.get_text()
2215  u'\nI linked to example.com\n'
2216  soup.i.get_text()
2217  u'example.com'
2218
2219You can specify a string to be used to join the bits of text
2220together::
2221
2222 # soup.get_text("|")
2223 u'\nI linked to |example.com|\n'
2224
2225You can tell Beautiful Soup to strip whitespace from the beginning and
2226end of each bit of text::
2227
2228 # soup.get_text("|", strip=True)
2229 u'I linked to|example.com'
2230
2231But at that point you might want to use the :ref:`.stripped_strings <string-generators>`
2232generator instead, and process the text yourself::
2233
2234 [text for text in soup.stripped_strings]
2235 # [u'I linked to', u'example.com']
2236
2237Specifying the parser to use
2238============================
2239
2240If you just need to parse some HTML, you can dump the markup into the
2241``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful
2242Soup will pick a parser for you and parse the data. But there are a
2243few additional arguments you can pass in to the constructor to change
2244which parser is used.
2245
2246The first argument to the ``BeautifulSoup`` constructor is a string or
2247an open filehandle--the markup you want parsed. The second argument is
2248`how` you'd like the markup parsed.
2249
2250If you don't specify anything, you'll get the best HTML parser that's
2251installed. Beautiful Soup ranks lxml's parser as being the best, then
2252html5lib's, then Python's built-in parser. You can override this by
2253specifying one of the following:
2254
2255* What type of markup you want to parse. Currently supported are
2256  "html", "xml", and "html5".
2257
2258* The name of the parser library you want to use. Currently supported
2259  options are "lxml", "html5lib", and "html.parser" (Python's
2260  built-in HTML parser).
2261
2262The section `Installing a parser`_ contrasts the supported parsers.
2263
2264If you don't have an appropriate parser installed, Beautiful Soup will
2265ignore your request and pick a different parser. Right now, the only
2266supported XML parser is lxml. If you don't have lxml installed, asking
2267for an XML parser won't give you one, and asking for "lxml" won't work
2268either.
2269
2270Differences between parsers
2271---------------------------
2272
2273Beautiful Soup presents the same interface to a number of different
2274parsers, but each parser is different. Different parsers will create
2275different parse trees from the same document. The biggest differences
2276are between the HTML parsers and the XML parsers. Here's a short
2277document, parsed as HTML::
2278
2279 BeautifulSoup("<a><b /></a>")
2280 # <html><head></head><body><a><b></b></a></body></html>
2281
2282Since an empty <b /> tag is not valid HTML, the parser turns it into a
2283<b></b> tag pair.
2284
2285Here's the same document parsed as XML (running this requires that you
2286have lxml installed). Note that the empty <b /> tag is left alone, and
2287that the document is given an XML declaration instead of being put
2288into an <html> tag.::
2289
2290 BeautifulSoup("<a><b /></a>", "xml")
2291 # <?xml version="1.0" encoding="utf-8"?>
2292 # <a><b/></a>
2293
2294There are also differences between HTML parsers. If you give Beautiful
2295Soup a perfectly-formed HTML document, these differences won't
2296matter. One parser will be faster than another, but they'll all give
2297you a data structure that looks exactly like the original HTML
2298document.
2299
2300But if the document is not perfectly-formed, different parsers will
2301give different results. Here's a short, invalid document parsed using
2302lxml's HTML parser. Note that the dangling </p> tag is simply
2303ignored::
2304
2305 BeautifulSoup("<a></p>", "lxml")
2306 # <html><body><a></a></body></html>
2307
2308Here's the same document parsed using html5lib::
2309
2310 BeautifulSoup("<a></p>", "html5lib")
2311 # <html><head></head><body><a><p></p></a></body></html>
2312
2313Instead of ignoring the dangling </p> tag, html5lib pairs it with an
2314opening <p> tag. This parser also adds an empty <head> tag to the
2315document.
2316
2317Here's the same document parsed with Python's built-in HTML
2318parser::
2319
2320 BeautifulSoup("<a></p>", "html.parser")
2321 # <a></a>
2322
2323Like html5lib, this parser ignores the closing </p> tag. Unlike
2324html5lib, this parser makes no attempt to create a well-formed HTML
2325document by adding a <body> tag. Unlike lxml, it doesn't even bother
2326to add an <html> tag.
2327
2328Since the document "<a></p>" is invalid, none of these techniques is
2329the "correct" way to handle it. The html5lib parser uses techniques
2330that are part of the HTML5 standard, so it has the best claim on being
2331the "correct" way, but all three techniques are legitimate.
2332
2333Differences between parsers can affect your script. If you're planning
2334on distributing your script to other people, or running it on multiple
2335machines, you should specify a parser in the ``BeautifulSoup``
2336constructor. That will reduce the chances that your users parse a
2337document differently from the way you parse it.
2338
2339Encodings
2340=========
2341
2342Any HTML or XML document is written in a specific encoding like ASCII
2343or UTF-8.  But when you load that document into Beautiful Soup, you'll
2344discover it's been converted to Unicode::
2345
2346 markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
2347 soup = BeautifulSoup(markup)
2348 soup.h1
2349 # <h1>Sacré bleu!</h1>
2350 soup.h1.string
2351 # u'Sacr\xe9 bleu!'
2352
2353It's not magic. (That sure would be nice.) Beautiful Soup uses a
2354sub-library called `Unicode, Dammit`_ to detect a document's encoding
2355and convert it to Unicode. The autodetected encoding is available as
2356the ``.original_encoding`` attribute of the ``BeautifulSoup`` object::
2357
2358 soup.original_encoding
2359 'utf-8'
2360
2361Unicode, Dammit guesses correctly most of the time, but sometimes it
2362makes mistakes. Sometimes it guesses correctly, but only after a
2363byte-by-byte search of the document that takes a very long time. If
2364you happen to know a document's encoding ahead of time, you can avoid
2365mistakes and delays by passing it to the ``BeautifulSoup`` constructor
2366as ``from_encoding``.
2367
2368Here's a document written in ISO-8859-8. The document is so short that
2369Unicode, Dammit can't get a good lock on it, and misidentifies it as
2370ISO-8859-7::
2371
2372 markup = b"<h1>\xed\xe5\xec\xf9</h1>"
2373 soup = BeautifulSoup(markup)
2374 soup.h1
2375 <h1>νεμω</h1>
2376 soup.original_encoding
2377 'ISO-8859-7'
2378
2379We can fix this by passing in the correct ``from_encoding``::
2380
2381 soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
2382 soup.h1
2383 <h1>םולש</h1>
2384 soup.original_encoding
2385 'iso8859-8'
2386
2387In rare cases (usually when a UTF-8 document contains text written in
2388a completely different encoding), the only way to get Unicode may be
2389to replace some characters with the special Unicode character
2390"REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do
2391this, it will set the ``.contains_replacement_characters`` attribute
2392to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This
2393lets you know that the Unicode representation is not an exact
2394representation of the original--some data was lost. If a document
2395contains �, but ``.contains_replacement_characters`` is ``False``,
2396you'll know that the � was there originally (as it is in this
2397paragraph) and doesn't stand in for missing data.
2398
2399Output encoding
2400---------------
2401
2402When you write out a document from Beautiful Soup, you get a UTF-8
2403document, even if the document wasn't in UTF-8 to begin with. Here's a
2404document written in the Latin-1 encoding::
2405
2406 markup = b'''
2407  <html>
2408   <head>
2409    <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
2410   </head>
2411   <body>
2412    <p>Sacr\xe9 bleu!</p>
2413   </body>
2414  </html>
2415 '''
2416
2417 soup = BeautifulSoup(markup)
2418 print(soup.prettify())
2419 # <html>
2420 #  <head>
2421 #   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
2422 #  </head>
2423 #  <body>
2424 #   <p>
2425 #    Sacré bleu!
2426 #   </p>
2427 #  </body>
2428 # </html>
2429
2430Note that the <meta> tag has been rewritten to reflect the fact that
2431the document is now in UTF-8.
2432
2433If you don't want UTF-8, you can pass an encoding into ``prettify()``::
2434
2435 print(soup.prettify("latin-1"))
2436 # <html>
2437 #  <head>
2438 #   <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
2439 # ...
2440
2441You can also call encode() on the ``BeautifulSoup`` object, or any
2442element in the soup, just as if it were a Python string::
2443
2444 soup.p.encode("latin-1")
2445 # '<p>Sacr\xe9 bleu!</p>'
2446
2447 soup.p.encode("utf-8")
2448 # '<p>Sacr\xc3\xa9 bleu!</p>'
2449
2450Any characters that can't be represented in your chosen encoding will
2451be converted into numeric XML entity references. Here's a document
2452that includes the Unicode character SNOWMAN::
2453
2454 markup = u"<b>\N{SNOWMAN}</b>"
2455 snowman_soup = BeautifulSoup(markup)
2456 tag = snowman_soup.b
2457
2458The SNOWMAN character can be part of a UTF-8 document (it looks like
2459☃), but there's no representation for that character in ISO-Latin-1 or
2460ASCII, so it's converted into "&#9731" for those encodings::
2461
2462 print(tag.encode("utf-8"))
2463 # <b>☃</b>
2464
2465 print tag.encode("latin-1")
2466 # <b>&#9731;</b>
2467
2468 print tag.encode("ascii")
2469 # <b>&#9731;</b>
2470
2471Unicode, Dammit
2472---------------
2473
2474You can use Unicode, Dammit without using Beautiful Soup. It's useful
2475whenever you have data in an unknown encoding and you just want it to
2476become Unicode::
2477
2478 from bs4 import UnicodeDammit
2479 dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
2480 print(dammit.unicode_markup)
2481 # Sacré bleu!
2482 dammit.original_encoding
2483 # 'utf-8'
2484
2485Unicode, Dammit's guesses will get a lot more accurate if you install
2486the ``chardet`` or ``cchardet`` Python libraries. The more data you
2487give Unicode, Dammit, the more accurately it will guess. If you have
2488your own suspicions as to what the encoding might be, you can pass
2489them in as a list::
2490
2491 dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
2492 print(dammit.unicode_markup)
2493 # Sacré bleu!
2494 dammit.original_encoding
2495 # 'latin-1'
2496
2497Unicode, Dammit has two special features that Beautiful Soup doesn't
2498use.
2499
2500Smart quotes
2501^^^^^^^^^^^^
2502
2503You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML
2504entities::
2505
2506 markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
2507
2508 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
2509 # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
2510
2511 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
2512 # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
2513
2514You can also convert Microsoft smart quotes to ASCII quotes::
2515
2516 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
2517 # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'
2518
2519Hopefully you'll find this feature useful, but Beautiful Soup doesn't
2520use it. Beautiful Soup prefers the default behavior, which is to
2521convert Microsoft smart quotes to Unicode characters along with
2522everything else::
2523
2524 UnicodeDammit(markup, ["windows-1252"]).unicode_markup
2525 # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'
2526
2527Inconsistent encodings
2528^^^^^^^^^^^^^^^^^^^^^^
2529
2530Sometimes a document is mostly in UTF-8, but contains Windows-1252
2531characters such as (again) Microsoft smart quotes. This can happen
2532when a website includes data from multiple sources. You can use
2533``UnicodeDammit.detwingle()`` to turn such a document into pure
2534UTF-8. Here's a simple example::
2535
2536 snowmen = (u"\N{SNOWMAN}" * 3)
2537 quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
2538 doc = snowmen.encode("utf8") + quote.encode("windows_1252")
2539
2540This document is a mess. The snowmen are in UTF-8 and the quotes are
2541in Windows-1252. You can display the snowmen or the quotes, but not
2542both::
2543
2544 print(doc)
2545 # ☃☃☃�I like snowmen!�
2546
2547 print(doc.decode("windows-1252"))
2548 # ☃☃☃“I like snowmen!”
2549
2550Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and
2551decoding it as Windows-1252 gives you gibberish. Fortunately,
2552``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8,
2553allowing you to decode it to Unicode and display the snowmen and quote
2554marks simultaneously::
2555
2556 new_doc = UnicodeDammit.detwingle(doc)
2557 print(new_doc.decode("utf8"))
2558 # ☃☃☃“I like snowmen!”
2559
2560``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252
2561embedded in UTF-8 (or vice versa, I suppose), but this is the most
2562common case.
2563
2564Note that you must know to call ``UnicodeDammit.detwingle()`` on your
2565data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit``
2566constructor. Beautiful Soup assumes that a document has a single
2567encoding, whatever it might be. If you pass it a document that
2568contains both UTF-8 and Windows-1252, it's likely to think the whole
2569document is Windows-1252, and the document will come out looking like
2570` ☃☃☃“I like snowmen!”`.
2571
2572``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0.
2573
2574Parsing only part of a document
2575===============================
2576
2577Let's say you want to use Beautiful Soup look at a document's <a>
2578tags. It's a waste of time and memory to parse the entire document and
2579then go over it again looking for <a> tags. It would be much faster to
2580ignore everything that wasn't an <a> tag in the first place. The
2581``SoupStrainer`` class allows you to choose which parts of an incoming
2582document are parsed. You just create a ``SoupStrainer`` and pass it in
2583to the ``BeautifulSoup`` constructor as the ``parse_only`` argument.
2584
2585(Note that *this feature won't work if you're using the html5lib parser*.
2586If you use html5lib, the whole document will be parsed, no
2587matter what. This is because html5lib constantly rearranges the parse
2588tree as it works, and if some part of the document didn't actually
2589make it into the parse tree, it'll crash. To avoid confusion, in the
2590examples below I'll be forcing Beautiful Soup to use Python's
2591built-in parser.)
2592
2593``SoupStrainer``
2594----------------
2595
2596The ``SoupStrainer`` class takes the same arguments as a typical
2597method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs
2598<attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are
2599three ``SoupStrainer`` objects::
2600
2601 from bs4 import SoupStrainer
2602
2603 only_a_tags = SoupStrainer("a")
2604
2605 only_tags_with_id_link2 = SoupStrainer(id="link2")
2606
2607 def is_short_string(string):
2608     return len(string) < 10
2609
2610 only_short_strings = SoupStrainer(text=is_short_string)
2611
2612I'm going to bring back the "three sisters" document one more time,
2613and we'll see what the document looks like when it's parsed with these
2614three ``SoupStrainer`` objects::
2615
2616 html_doc = """
2617 <html><head><title>The Dormouse's story</title></head>
2618
2619 <p class="title"><b>The Dormouse's story</b></p>
2620
2621 <p class="story">Once upon a time there were three little sisters; and their names were
2622 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
2623 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
2624 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
2625 and they lived at the bottom of a well.</p>
2626
2627 <p class="story">...</p>
2628 """
2629
2630 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
2631 # <a class="sister" href="http://example.com/elsie" id="link1">
2632 #  Elsie
2633 # </a>
2634 # <a class="sister" href="http://example.com/lacie" id="link2">
2635 #  Lacie
2636 # </a>
2637 # <a class="sister" href="http://example.com/tillie" id="link3">
2638 #  Tillie
2639 # </a>
2640
2641 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
2642 # <a class="sister" href="http://example.com/lacie" id="link2">
2643 #  Lacie
2644 # </a>
2645
2646 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
2647 # Elsie
2648 # ,
2649 # Lacie
2650 # and
2651 # Tillie
2652 # ...
2653 #
2654
2655You can also pass a ``SoupStrainer`` into any of the methods covered
2656in `Searching the tree`_. This probably isn't terribly useful, but I
2657thought I'd mention it::
2658
2659 soup = BeautifulSoup(html_doc)
2660 soup.find_all(only_short_strings)
2661 # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
2662 #  u'\n\n', u'...', u'\n']
2663
2664Troubleshooting
2665===============
2666
2667.. _diagnose:
2668
2669``diagnose()``
2670--------------
2671
2672If you're having trouble understanding what Beautiful Soup does to a
2673document, pass the document into the ``diagnose()`` function. (New in
2674Beautiful Soup 4.2.0.)  Beautiful Soup will print out a report showing
2675you how different parsers handle the document, and tell you if you're
2676missing a parser that Beautiful Soup could be using::
2677
2678 from bs4.diagnose import diagnose
2679 data = open("bad.html").read()
2680 diagnose(data)
2681
2682 # Diagnostic running on Beautiful Soup 4.2.0
2683 # Python version 2.7.3 (default, Aug  1 2012, 05:16:07)
2684 # I noticed that html5lib is not installed. Installing it may help.
2685 # Found lxml version 2.3.2.0
2686 #
2687 # Trying to parse your data with html.parser
2688 # Here's what html.parser did with the document:
2689 # ...
2690
2691Just looking at the output of diagnose() may show you how to solve the
2692problem. Even if not, you can paste the output of ``diagnose()`` when
2693asking for help.
2694
2695Errors when parsing a document
2696------------------------------
2697
2698There are two different kinds of parse errors. There are crashes,
2699where you feed a document to Beautiful Soup and it raises an
2700exception, usually an ``HTMLParser.HTMLParseError``. And there is
2701unexpected behavior, where a Beautiful Soup parse tree looks a lot
2702different than the document used to create it.
2703
2704Almost none of these problems turn out to be problems with Beautiful
2705Soup. This is not because Beautiful Soup is an amazingly well-written
2706piece of software. It's because Beautiful Soup doesn't include any
2707parsing code. Instead, it relies on external parsers. If one parser
2708isn't working on a certain document, the best solution is to try a
2709different parser. See `Installing a parser`_ for details and a parser
2710comparison.
2711
2712The most common parse errors are ``HTMLParser.HTMLParseError:
2713malformed start tag`` and ``HTMLParser.HTMLParseError: bad end
2714tag``. These are both generated by Python's built-in HTML parser
2715library, and the solution is to :ref:`install lxml or
2716html5lib. <parser-installation>`
2717
2718The most common type of unexpected behavior is that you can't find a
2719tag that you know is in the document. You saw it going in, but
2720``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is
2721another common problem with Python's built-in HTML parser, which
2722sometimes skips tags it doesn't understand.  Again, the solution is to
2723:ref:`install lxml or html5lib. <parser-installation>`
2724
2725Version mismatch problems
2726-------------------------
2727
2728* ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME =
2729  u'[document]'``): Caused by running the Python 2 version of
2730  Beautiful Soup under Python 3, without converting the code.
2731
2732* ``ImportError: No module named HTMLParser`` - Caused by running the
2733  Python 2 version of Beautiful Soup under Python 3.
2734
2735* ``ImportError: No module named html.parser`` - Caused by running the
2736  Python 3 version of Beautiful Soup under Python 2.
2737
2738* ``ImportError: No module named BeautifulSoup`` - Caused by running
2739  Beautiful Soup 3 code on a system that doesn't have BS3
2740  installed. Or, by writing Beautiful Soup 4 code without knowing that
2741  the package name has changed to ``bs4``.
2742
2743* ``ImportError: No module named bs4`` - Caused by running Beautiful
2744  Soup 4 code on a system that doesn't have BS4 installed.
2745
2746.. _parsing-xml:
2747
2748Parsing XML
2749-----------
2750
2751By default, Beautiful Soup parses documents as HTML. To parse a
2752document as XML, pass in "xml" as the second argument to the
2753``BeautifulSoup`` constructor::
2754
2755 soup = BeautifulSoup(markup, "xml")
2756
2757You'll need to :ref:`have lxml installed <parser-installation>`.
2758
2759Other parser problems
2760---------------------
2761
2762* If your script works on one computer but not another, it's probably
2763  because the two computers have different parser libraries
2764  available. For example, you may have developed the script on a
2765  computer that has lxml installed, and then tried to run it on a
2766  computer that only has html5lib installed. See `Differences between
2767  parsers`_ for why this matters, and fix the problem by mentioning a
2768  specific parser library in the ``BeautifulSoup`` constructor.
2769
2770* Because `HTML tags and attributes are case-insensitive
2771  <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML
2772  parsers convert tag and attribute names to lowercase. That is, the
2773  markup <TAG></TAG> is converted to <tag></tag>. If you want to
2774  preserve mixed-case or uppercase tags and attributes, you'll need to
2775  :ref:`parse the document as XML. <parsing-xml>`
2776
2777.. _misc:
2778
2779Miscellaneous
2780-------------
2781
2782* ``UnicodeEncodeError: 'charmap' codec can't encode character
2783  u'\xfoo' in position bar`` (or just about any other
2784  ``UnicodeEncodeError``) - This is not a problem with Beautiful Soup.
2785  This problem shows up in two main situations. First, when you try to
2786  print a Unicode character that your console doesn't know how to
2787  display. (See `this page on the Python wiki
2788  <http://wiki.python.org/moin/PrintFails>`_ for help.) Second, when
2789  you're writing to a file and you pass in a Unicode character that's
2790  not supported by your default encoding.  In this case, the simplest
2791  solution is to explicitly encode the Unicode string into UTF-8 with
2792  ``u.encode("utf8")``.
2793
2794* ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the
2795  tag in question doesn't define the ``attr`` attribute. The most
2796  common errors are ``KeyError: 'href'`` and ``KeyError:
2797  'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is
2798  defined, just as you would with a Python dictionary.
2799
2800* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This
2801  usually happens because you expected ``find_all()`` to return a
2802  single tag or string. But ``find_all()`` returns a _list_ of tags
2803  and strings--a ``ResultSet`` object. You need to iterate over the
2804  list and look at the ``.foo`` of each one. Or, if you really only
2805  want one result, you need to use ``find()`` instead of
2806  ``find_all()``.
2807
2808* ``AttributeError: 'NoneType' object has no attribute 'foo'`` - This
2809  usually happens because you called ``find()`` and then tried to
2810  access the `.foo`` attribute of the result. But in your case,
2811  ``find()`` didn't find anything, so it returned ``None``, instead of
2812  returning a tag or a string. You need to figure out why your
2813  ``find()`` call isn't returning anything.
2814
2815Improving Performance
2816---------------------
2817
2818Beautiful Soup will never be as fast as the parsers it sits on top
2819of. If response time is critical, if you're paying for computer time
2820by the hour, or if there's any other reason why computer time is more
2821valuable than programmer time, you should forget about Beautiful Soup
2822and work directly atop `lxml <http://lxml.de/>`_.
2823
2824That said, there are things you can do to speed up Beautiful Soup. If
2825you're not using lxml as the underlying parser, my advice is to
2826:ref:`start <parser-installation>`. Beautiful Soup parses documents
2827significantly faster using lxml than using html.parser or html5lib.
2828
2829You can speed up encoding detection significantly by installing the
2830`cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
2831
2832`Parsing only part of a document`_ won't save you much time parsing
2833the document, but it can save a lot of memory, and it'll make
2834`searching` the document much faster.
2835
2836Beautiful Soup 3
2837================
2838
2839Beautiful Soup 3 is the previous release series, and is no longer
2840being actively developed. It's currently packaged with all major Linux
2841distributions:
2842
2843:kbd:`$ apt-get install python-beautifulsoup`
2844
2845It's also published through PyPi as ``BeautifulSoup``.:
2846
2847:kbd:`$ easy_install BeautifulSoup`
2848
2849:kbd:`$ pip install BeautifulSoup`
2850
2851You can also `download a tarball of Beautiful Soup 3.2.0
2852<http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_.
2853
2854If you ran ``easy_install beautifulsoup`` or ``easy_install
2855BeautifulSoup``, but your code doesn't work, you installed Beautiful
2856Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``.
2857
2858`The documentation for Beautiful Soup 3 is archived online
2859<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If
2860your first language is Chinese, it might be easier for you to read
2861`the Chinese translation of the Beautiful Soup 3 documentation
2862<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_,
2863then read this document to find out about the changes made in
2864Beautiful Soup 4.
2865
2866Porting code to BS4
2867-------------------
2868
2869Most code written against Beautiful Soup 3 will work against Beautiful
2870Soup 4 with one simple change. All you should have to do is change the
2871package name from ``BeautifulSoup`` to ``bs4``. So this::
2872
2873  from BeautifulSoup import BeautifulSoup
2874
2875becomes this::
2876
2877  from bs4 import BeautifulSoup
2878
2879* If you get the ``ImportError`` "No module named BeautifulSoup", your
2880  problem is that you're trying to run Beautiful Soup 3 code, but you
2881  only have Beautiful Soup 4 installed.
2882
2883* If you get the ``ImportError`` "No module named bs4", your problem
2884  is that you're trying to run Beautiful Soup 4 code, but you only
2885  have Beautiful Soup 3 installed.
2886
2887Although BS4 is mostly backwards-compatible with BS3, most of its
2888methods have been deprecated and given new names for `PEP 8 compliance
2889<http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other
2890renames and changes, and a few of them break backwards compatibility.
2891
2892Here's what you'll need to know to convert your BS3 code and habits to BS4:
2893
2894You need a parser
2895^^^^^^^^^^^^^^^^^
2896
2897Beautiful Soup 3 used Python's ``SGMLParser``, a module that was
2898deprecated and removed in Python 3.0. Beautiful Soup 4 uses
2899``html.parser`` by default, but you can plug in lxml or html5lib and
2900use that instead. See `Installing a parser`_ for a comparison.
2901
2902Since ``html.parser`` is not the same parser as ``SGMLParser``, it
2903will treat invalid markup differently. Usually the "difference" is
2904that ``html.parser`` crashes. In that case, you'll need to install
2905another parser. But sometimes ``html.parser`` just creates a different
2906parse tree than ``SGMLParser`` would. If this happens, you may need to
2907update your BS3 scraping code to deal with the new tree.
2908
2909Method names
2910^^^^^^^^^^^^
2911
2912* ``renderContents`` -> ``encode_contents``
2913* ``replaceWith`` -> ``replace_with``
2914* ``replaceWithChildren`` -> ``unwrap``
2915* ``findAll`` -> ``find_all``
2916* ``findAllNext`` -> ``find_all_next``
2917* ``findAllPrevious`` -> ``find_all_previous``
2918* ``findNext`` -> ``find_next``
2919* ``findNextSibling`` -> ``find_next_sibling``
2920* ``findNextSiblings`` -> ``find_next_siblings``
2921* ``findParent`` -> ``find_parent``
2922* ``findParents`` -> ``find_parents``
2923* ``findPrevious`` -> ``find_previous``
2924* ``findPreviousSibling`` -> ``find_previous_sibling``
2925* ``findPreviousSiblings`` -> ``find_previous_siblings``
2926* ``nextSibling`` -> ``next_sibling``
2927* ``previousSibling`` -> ``previous_sibling``
2928
2929Some arguments to the Beautiful Soup constructor were renamed for the
2930same reasons:
2931
2932* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
2933* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
2934
2935I renamed one method for compatibility with Python 3:
2936
2937* ``Tag.has_key()`` -> ``Tag.has_attr()``
2938
2939I renamed one attribute to use more accurate terminology:
2940
2941* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
2942
2943I renamed three attributes to avoid using words that have special
2944meaning to Python. Unlike the others, these changes are *not backwards
2945compatible.* If you used these attributes in BS3, your code will break
2946on BS4 until you change them.
2947
2948* ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup``
2949* ``Tag.next`` -> ``Tag.next_element``
2950* ``Tag.previous`` -> ``Tag.previous_element``
2951
2952Generators
2953^^^^^^^^^^
2954
2955I gave the generators PEP 8-compliant names, and transformed them into
2956properties:
2957
2958* ``childGenerator()`` -> ``children``
2959* ``nextGenerator()`` -> ``next_elements``
2960* ``nextSiblingGenerator()`` -> ``next_siblings``
2961* ``previousGenerator()`` -> ``previous_elements``
2962* ``previousSiblingGenerator()`` -> ``previous_siblings``
2963* ``recursiveChildGenerator()`` -> ``descendants``
2964* ``parentGenerator()`` -> ``parents``
2965
2966So instead of this::
2967
2968 for parent in tag.parentGenerator():
2969     ...
2970
2971You can write this::
2972
2973 for parent in tag.parents:
2974     ...
2975
2976(But the old code will still work.)
2977
2978Some of the generators used to yield ``None`` after they were done, and
2979then stop. That was a bug. Now the generators just stop.
2980
2981There are two new generators, :ref:`.strings and
2982.stripped_strings <string-generators>`. ``.strings`` yields
2983NavigableString objects, and ``.stripped_strings`` yields Python
2984strings that have had whitespace stripped.
2985
2986XML
2987^^^
2988
2989There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To
2990parse XML you pass in "xml" as the second argument to the
2991``BeautifulSoup`` constructor. For the same reason, the
2992``BeautifulSoup`` constructor no longer recognizes the ``isHTML``
2993argument.
2994
2995Beautiful Soup's handling of empty-element XML tags has been
2996improved. Previously when you parsed XML you had to explicitly say
2997which tags were considered empty-element tags. The ``selfClosingTags``
2998argument to the constructor is no longer recognized. Instead,
2999Beautiful Soup considers any empty tag to be an empty-element tag. If
3000you add a child to an empty-element tag, it stops being an
3001empty-element tag.
3002
3003Entities
3004^^^^^^^^
3005
3006An incoming HTML or XML entity is always converted into the
3007corresponding Unicode character. Beautiful Soup 3 had a number of
3008overlapping ways of dealing with entities, which have been
3009removed. The ``BeautifulSoup`` constructor no longer recognizes the
3010``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode,
3011Dammit`_ still has ``smart_quotes_to``, but its default is now to turn
3012smart quotes into Unicode.) The constants ``HTML_ENTITIES``,
3013``XML_ENTITIES``, and ``XHTML_ENTITIES`` have been removed, since they
3014configure a feature (transforming some but not all entities into
3015Unicode characters) that no longer exists.
3016
3017If you want to turn Unicode characters back into HTML entities on
3018output, rather than turning them into UTF-8 characters, you need to
3019use an :ref:`output formatter <output_formatters>`.
3020
3021Miscellaneous
3022^^^^^^^^^^^^^
3023
3024:ref:`Tag.string <.string>` now operates recursively. If tag A
3025contains a single tag B and nothing else, then A.string is the same as
3026B.string. (Previously, it was None.)
3027
3028`Multi-valued attributes`_ like ``class`` have lists of strings as
3029their values, not strings. This may affect the way you search by CSS
3030class.
3031
3032If you pass one of the ``find*`` methods both :ref:`text <text>` `and`
3033a tag-specific argument like :ref:`name <name>`, Beautiful Soup will
3034search for tags that match your tag-specific criteria and whose
3035:ref:`Tag.string <.string>` matches your value for :ref:`text
3036<text>`. It will `not` find the strings themselves. Previously,
3037Beautiful Soup ignored the tag-specific arguments and looked for
3038strings.
3039
3040The ``BeautifulSoup`` constructor no longer recognizes the
3041`markupMassage` argument. It's now the parser's responsibility to
3042handle markup correctly.
3043
3044The rarely-used alternate parser classes like
3045``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been
3046removed. It's now the parser's decision how to handle ambiguous
3047markup.
3048
3049The ``prettify()`` method now returns a Unicode string, not a bytestring.
3050