Beautiful Soup Documentation
============================
.. image:: 6.1.jpg
:align: right
:alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
`Beautiful Soup The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ...
#
# The Dormouse's story
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
# ...
# The Dormouse's story Back to the homepage Back to the homepage The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ... The Dormouse's story
tag", and so on. Beautiful Soup offers tools for reconstructing the
initial parse of the document.
.. _element-generators:
``.next_element`` and ``.previous_element``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``.next_element`` attribute of a string or tag points to whatever
was parsed immediately afterwards. It might be the same as
``.next_sibling``, but it's usually drastically different.
Here's the final tag in the "three sisters" document. Its
``.next_sibling`` is a string: the conclusion of the sentence that was
interrupted by the start of the tag.::
last_a_tag = soup.find("a", id="link3")
last_a_tag
# Tillie
last_a_tag.next_sibling
# '; and they lived at the bottom of a well.'
But the ``.next_element`` of that tag, the thing that was parsed
immediately after the tag, is `not` the rest of that sentence:
it's the word "Tillie"::
last_a_tag.next_element
# u'Tillie'
That's because in the original markup, the word "Tillie" appeared
before that semicolon. The parser encountered an tag, then the
word "Tillie", then the closing tag, then the semicolon and rest of
the sentence. The semicolon is on the same level as the tag, but the
word "Tillie" was encountered first.
The ``.previous_element`` attribute is the exact opposite of
``.next_element``. It points to whatever element was parsed
immediately before this one::
last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# Tillie
``.next_elements`` and ``.previous_elements``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should get the idea by now. You can use these iterators to move
forward or backward in the document as it was parsed::
for element in last_a_tag.next_elements:
print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# ... The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ...
tags::
soup.find_all(has_class_but_no_id)
# [ The Dormouse's story Once upon a time there were... ... tags. It doesn't pick up the
tags, because those tags define both "class" and "id". It doesn't pick
up tags like and The Dormouse's story tag with the CSS class "title"?
Let's look at the arguments to ``find_all()``.
.. _name:
The ``name`` argument
^^^^^^^^^^^^^^^^^^^^^
Pass in a value for ``name`` and you'll tell Beautiful Soup to only
consider tags with certain names. Text strings will be ignored, as
will tags whose names that don't match.
This is the simplest usage::
soup.find_all("title")
# [ The Dormouse's story Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well. tags is an
indirect parent of the string, and our search finds that as
well. There's a tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with ``find_parents()``.
You may have made the connection between ``find_parent()`` and
``find_parents()``, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The connection is very strong. These search methods
actually use ``.parents`` to iterate over all the parents, and check
each one against the provided filter to see if it matches.
``find_next_siblings()`` and ``find_next_sibling()``
----------------------------------------------------
Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
the same part of the tree as the tag we started from. For these
methods, all that matters is that an element match the filter, and
show up later in the document than the starting element.
``find_all_previous()`` and ``find_previous()``
-----------------------------------------------
Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
tag that contains an tag must have shown up before the
tag it contains.
CSS selectors
-------------
Beautiful Soup supports the most commonly-used `CSS selectors
... Hello Howdy, y'all Pip-pip, old fruit Bonjour mes amis Hello Howdy, y'all Pip-pip, old fruit I wish I was bold. I wish I was bold. The law firm of Dewey, Cheatem, & Howe The law firm of Dewey, Cheatem, & Howe Il a dit <<Sacré bleu!>>
# Il a dit <<Sacré bleu!>>
#
# Il a dit <<Sacré bleu!>>
#
# Il a dit <
# IL A DIT <
# IL A DIT <<SACRÉ BLEU!>>
# tag. This parser also adds an empty Extremely bold
Attributes
^^^^^^^^^^
A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::
tag['class']
# u'boldest'
You can access that dictionary directly as ``.attrs``::
tag.attrs
# {u'class': u'boldest'}
You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
.. _multivalue:
Multi-valued attributes
&&&&&&&&&&&&&&&&&&&&&&&
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is ``class`` (that is, a tag can have more than
one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
of a multi-valued attribute as a list::
css_soup = BeautifulSoup('')
css_soup.p['class']
# ["body", "strikeout"]
css_soup = BeautifulSoup('')
css_soup.p['class']
# ["body"]
If an attribute `looks` like it has more than one value, but it's not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::
id_soup = BeautifulSoup('')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are
consolidated::
rel_soup = BeautifulSoup('No longer bold
``NavigableString`` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
them. In particular, since a string can't contain anything (the way a
tag may contain a string or another tag), strings don't support the
``.contents`` or ``.string`` attributes, or the ``find()`` method.
If you want to use a ``NavigableString`` outside of Beautiful Soup,
you should call ``unicode()`` on it to turn it into a normal Python
Unicode string. If you don't, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you're
done using Beautiful Soup. This is a big waste of memory.
``BeautifulSoup``
-----------------
The ``BeautifulSoup`` object itself represents the document as a
whole. For most purposes, you can treat it as a :ref:`Tag`
object. This means it supports most of the methods described in
`Navigating the tree`_ and `Searching the tree`_.
Since the ``BeautifulSoup`` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::
soup.name
# u'[document]'
Comments and other special strings
----------------------------------
``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
everything you'll see in an HTML or XML file, but there are a few
leftover bits. The only one you'll probably ever need to worry about
is the comment::
markup = ""
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
Modifying ``.string``
---------------------
If you set a tag's ``.string`` attribute, the tag's contents are
replaced with the string you give::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# New link text.
Be careful: if the tag contained other tags, they and all their
contents will be destroyed.
``append()``
------------
You can add to a tag's contents with ``Tag.append()``. It works just
like calling ``.append()`` on a Python list::
soup = BeautifulSoup("Foo")
soup.a.append("Bar")
soup
# FooBar
soup.a.contents
# [u'Foo', u'Bar']
``BeautifulSoup.new_string()`` and ``.new_tag()``
-------------------------------------------------
If you need to add a string to a document, no problem--you can pass a
Python string in to ``append()``, or you can call the factory method
``BeautifulSoup.new_string()``::
soup = BeautifulSoup("")
tag = soup.b
tag.append("Hello")
new_string = soup.new_string(" there")
tag.append(new_string)
tag
# Hello there.
tag.contents
# [u'Hello', u' there']
If you want to create a comment or some other subclass of
``NavigableString``, pass that class as the second argument to
``new_string()``::
from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
tag
# Hello there
tag.contents
# [u'Hello', u' there', u'Nice to see you.']
(This is a new feature in Beautiful Soup 4.2.1.)
What if you need to create a whole new tag? The best solution is to
call the factory method ``BeautifulSoup.new_tag()``::
soup = BeautifulSoup("")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
#
new_tag.string = "Link text."
original_tag
# Link text.
Only the first argument, the tag name, is required.
``insert()``
------------
``Tag.insert()`` is just like ``Tag.append()``, except the new element
doesn't necessarily go at the end of its parent's
``.contents``. It'll be inserted at whatever numeric position you
say. It works just like ``.insert()`` on a Python list::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# I linked to but did not endorse example.com
tag.contents
# [u'I linked to ', u'but did not endorse', example.com]
``insert_before()`` and ``insert_after()``
------------------------------------------
The ``insert_before()`` method inserts a tag or string immediately
before something else in the parse tree::
soup = BeautifulSoup("stop")
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b
# Don'tstop
The ``insert_after()`` method moves a tag or string so that it
immediately follows something else in the parse tree::
soup.b.i.insert_after(soup.new_string(" ever "))
soup.b
# Don't ever stop
soup.b.contents
# [Don't, u' ever ', u'stop']
``clear()``
-----------
``Tag.clear()`` removes the contents of a tag::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.clear()
tag
#
``extract()``
-------------
``PageElement.extract()`` removes a tag or string from the tree. It
returns the tag or string that was extracted::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
i_tag = soup.i.extract()
a_tag
# I linked to
i_tag
# example.com
print(i_tag.parent)
None
At this point you effectively have two parse trees: one rooted at the
``BeautifulSoup`` object you used to parse the document, and one rooted
at the tag that was extracted. You can go on to call ``extract`` on
a child of the element you extracted::
my_string = i_tag.string.extract()
my_string
# u'example.com'
print(my_string.parent)
# None
i_tag
#
``decompose()``
---------------
``Tag.decompose()`` removes a tag from the tree, then `completely
destroys it and its contents`::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
soup.i.decompose()
a_tag
# I linked to
.. _replace_with:
``replace_with()``
------------------
``PageElement.replace_with()`` removes a tag or string from the tree,
and replaces it with the tag or string of your choice::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# I linked to example.net
``replace_with()`` returns the tag or string that was replaced, so
that you can examine it or add it back to another part of the tree.
``wrap()``
----------
``PageElement.wrap()`` wraps an element in the tag you specify. It
returns the new wrapper::
soup = BeautifulSoup("
Sacr\xe9 bleu!
''' soup = BeautifulSoup(markup) print(soup.prettify()) # # # # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don't want UTF-8, you can pass an encoding into ``prettify()``:: print(soup.prettify("latin-1")) # # # # ... You can also call encode() on the ``BeautifulSoup`` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") # 'Sacr\xe9 bleu!
' soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!
' Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Here's a document that includes the Unicode character SNOWMAN:: markup = u"\N{SNOWMAN}" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b The SNOWMAN character can be part of a UTF-8 document (it looks like ☃), but there's no representation for that character in ISO-Latin-1 or ASCII, so it's converted into "☃" for those encodings:: print(tag.encode("utf-8")) # ☃ print tag.encode("latin-1") # ☃ print tag.encode("ascii") # ☃ Unicode, Dammit --------------- You can use Unicode, Dammit without using Beautiful Soup. It's useful whenever you have data in an unknown encoding and you just want it to become Unicode:: from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8' Unicode, Dammit's guesses will get a lot more accurate if you install the ``chardet`` or ``cchardet`` Python libraries. The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1' Unicode, Dammit has two special features that Beautiful Soup doesn't use. Smart quotes ^^^^^^^^^^^^ You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:: markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes
" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' You can also convert Microsoft smart quotes to ASCII quotes:: UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup # u'I just "love" Microsoft Word\'s smart quotes
' Hopefully you'll find this feature useful, but Beautiful Soup doesn't use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes
' Inconsistent encodings ^^^^^^^^^^^^^^^^^^^^^^ Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use ``UnicodeDammit.detwingle()`` to turn such a document into pure UTF-8. Here's a simple example:: snowmen = (u"\N{SNOWMAN}" * 3) quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") doc = snowmen.encode("utf8") + quote.encode("windows_1252") This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both:: print(doc) # ☃☃☃�I like snowmen!� print(doc.decode("windows-1252")) # ☃☃☃“I like snowmen!” Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and decoding it as Windows-1252 gives you gibberish. Fortunately, ``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8, allowing you to decode it to Unicode and display the snowmen and quote marks simultaneously:: new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ☃☃☃“I like snowmen!” ``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call ``UnicodeDammit.detwingle()`` on your data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it's likely to think the whole document is Windows-1252, and the document will come out looking like ` ☃☃☃“I like snowmen!”`. ``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0. Parsing only part of a document =============================== Let's say you want to use Beautiful Soup look at a document's tags. It's a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everything that wasn't an tag in the first place. The ``SoupStrainer`` class allows you to choose which parts of an incoming document are parsed. You just create a ``SoupStrainer`` and pass it in to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. (Note that *this feature won't work if you're using the html5lib parser*. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn't actually make it into the parse tree, it'll crash. To avoid confusion, in the examples below I'll be forcing Beautiful Soup to use Python's built-in parser.) ``SoupStrainer`` ---------------- The ``SoupStrainer`` class takes the same arguments as a typical method from `Searching the tree`_: :ref:`nameThe Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a ``SoupStrainer`` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] Troubleshooting =============== .. _diagnose: ``diagnose()`` -------------- If you're having trouble understanding what Beautiful Soup does to a document, pass the document into the ``diagnose()`` function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you're missing a parser that Beautiful Soup could be using:: from bs4.diagnose import diagnose data = open("bad.html").read() diagnose(data) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ... Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of ``diagnose()`` when asking for help. Errors when parsing a document ------------------------------ There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an ``HTMLParser.HTMLParseError``. And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it. Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It's because Beautiful Soup doesn't include any parsing code. Instead, it relies on external parsers. If one parser isn't working on a certain document, the best solution is to try a different parser. See `Installing a parser`_ for details and a parser comparison. The most common parse errors are ``HTMLParser.HTMLParseError: malformed start tag`` and ``HTMLParser.HTMLParseError: bad end tag``. These are both generated by Python's built-in HTML parser library, and the solution is to :ref:`install lxml or html5lib.