1Beautiful Soup Documentation 2============================ 3 4.. image:: 6.1.jpg 5 :align: right 6 :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." 7 8`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a 9Python library for pulling data out of HTML and XML files. It works 10with your favorite parser to provide idiomatic ways of navigating, 11searching, and modifying the parse tree. It commonly saves programmers 12hours or days of work. 13 14These instructions illustrate all major features of Beautiful Soup 4, 15with examples. I show you what the library is good for, how it works, 16how to use it, how to make it do what you want, and what to do when it 17violates your expectations. 18 19The examples in this documentation should work the same way in Python 202.7 and Python 3.2. 21 22You might be looking for the documentation for `Beautiful Soup 3 23<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. 24If so, you should know that Beautiful Soup 3 is no longer being 25developed, and that Beautiful Soup 4 is recommended for all new 26projects. If you want to learn about the differences between Beautiful 27Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. 28 29This documentation has been translated into other languages by its users. 30 31* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) 32 33Getting help 34------------ 35 36If you have questions about Beautiful Soup, or run into problems, 37`send mail to the discussion group 38<https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_. If 39your problem involves parsing an HTML document, be sure to mention 40:ref:`what the diagnose() function says <diagnose>` about 41that document. 42 43Quick Start 44=========== 45 46Here's an HTML document I'll be using as an example throughout this 47document. It's part of a story from `Alice in Wonderland`:: 48 49 html_doc = """ 50 <html><head><title>The Dormouse's story</title></head> 51 <body> 52 <p class="title"><b>The Dormouse's story</b></p> 53 54 <p class="story">Once upon a time there were three little sisters; and their names were 55 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 56 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 57 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 58 and they lived at the bottom of a well.</p> 59 60 <p class="story">...</p> 61 """ 62 63Running the "three sisters" document through Beautiful Soup gives us a 64``BeautifulSoup`` object, which represents the document as a nested 65data structure:: 66 67 from bs4 import BeautifulSoup 68 soup = BeautifulSoup(html_doc) 69 70 print(soup.prettify()) 71 # <html> 72 # <head> 73 # <title> 74 # The Dormouse's story 75 # </title> 76 # </head> 77 # <body> 78 # <p class="title"> 79 # <b> 80 # The Dormouse's story 81 # </b> 82 # </p> 83 # <p class="story"> 84 # Once upon a time there were three little sisters; and their names were 85 # <a class="sister" href="http://example.com/elsie" id="link1"> 86 # Elsie 87 # </a> 88 # , 89 # <a class="sister" href="http://example.com/lacie" id="link2"> 90 # Lacie 91 # </a> 92 # and 93 # <a class="sister" href="http://example.com/tillie" id="link2"> 94 # Tillie 95 # </a> 96 # ; and they lived at the bottom of a well. 97 # </p> 98 # <p class="story"> 99 # ... 100 # </p> 101 # </body> 102 # </html> 103 104Here are some simple ways to navigate that data structure:: 105 106 soup.title 107 # <title>The Dormouse's story</title> 108 109 soup.title.name 110 # u'title' 111 112 soup.title.string 113 # u'The Dormouse's story' 114 115 soup.title.parent.name 116 # u'head' 117 118 soup.p 119 # <p class="title"><b>The Dormouse's story</b></p> 120 121 soup.p['class'] 122 # u'title' 123 124 soup.a 125 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 126 127 soup.find_all('a') 128 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 129 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 130 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 131 132 soup.find(id="link3") 133 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 134 135One common task is extracting all the URLs found within a page's <a> tags:: 136 137 for link in soup.find_all('a'): 138 print(link.get('href')) 139 # http://example.com/elsie 140 # http://example.com/lacie 141 # http://example.com/tillie 142 143Another common task is extracting all the text from a page:: 144 145 print(soup.get_text()) 146 # The Dormouse's story 147 # 148 # The Dormouse's story 149 # 150 # Once upon a time there were three little sisters; and their names were 151 # Elsie, 152 # Lacie and 153 # Tillie; 154 # and they lived at the bottom of a well. 155 # 156 # ... 157 158Does this look like what you need? If so, read on. 159 160Installing Beautiful Soup 161========================= 162 163If you're using a recent version of Debian or Ubuntu Linux, you can 164install Beautiful Soup with the system package manager: 165 166:kbd:`$ apt-get install python-bs4` 167 168Beautiful Soup 4 is published through PyPi, so if you can't install it 169with the system packager, you can install it with ``easy_install`` or 170``pip``. The package name is ``beautifulsoup4``, and the same package 171works on Python 2 and Python 3. 172 173:kbd:`$ easy_install beautifulsoup4` 174 175:kbd:`$ pip install beautifulsoup4` 176 177(The ``BeautifulSoup`` package is probably `not` what you want. That's 178the previous major release, `Beautiful Soup 3`_. Lots of software uses 179BS3, so it's still available, but if you're writing new code you 180should install ``beautifulsoup4``.) 181 182If you don't have ``easy_install`` or ``pip`` installed, you can 183`download the Beautiful Soup 4 source tarball 184<http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and 185install it with ``setup.py``. 186 187:kbd:`$ python setup.py install` 188 189If all else fails, the license for Beautiful Soup allows you to 190package the entire library with your application. You can download the 191tarball, copy its ``bs4`` directory into your application's codebase, 192and use Beautiful Soup without installing it at all. 193 194I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it 195should work with other recent versions. 196 197Problems after installation 198--------------------------- 199 200Beautiful Soup is packaged as Python 2 code. When you install it for 201use with Python 3, it's automatically converted to Python 3 code. If 202you don't install the package, the code won't be converted. There have 203also been reports on Windows machines of the wrong version being 204installed. 205 206If you get the ``ImportError`` "No module named HTMLParser", your 207problem is that you're running the Python 2 version of the code under 208Python 3. 209 210If you get the ``ImportError`` "No module named html.parser", your 211problem is that you're running the Python 3 version of the code under 212Python 2. 213 214In both cases, your best bet is to completely remove the Beautiful 215Soup installation from your system (including any directory created 216when you unzipped the tarball) and try the installation again. 217 218If you get the ``SyntaxError`` "Invalid syntax" on the line 219``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2 220code to Python 3. You can do this either by installing the package: 221 222:kbd:`$ python3 setup.py install` 223 224or by manually running Python's ``2to3`` conversion script on the 225``bs4`` directory: 226 227:kbd:`$ 2to3-3.2 -w bs4` 228 229.. _parser-installation: 230 231 232Installing a parser 233------------------- 234 235Beautiful Soup supports the HTML parser included in Python's standard 236library, but it also supports a number of third-party Python parsers. 237One is the `lxml parser <http://lxml.de/>`_. Depending on your setup, 238you might install lxml with one of these commands: 239 240:kbd:`$ apt-get install python-lxml` 241 242:kbd:`$ easy_install lxml` 243 244:kbd:`$ pip install lxml` 245 246Another alternative is the pure-Python `html5lib parser 247<http://code.google.com/p/html5lib/>`_, which parses HTML the way a 248web browser does. Depending on your setup, you might install html5lib 249with one of these commands: 250 251:kbd:`$ apt-get install python-html5lib` 252 253:kbd:`$ easy_install html5lib` 254 255:kbd:`$ pip install html5lib` 256 257This table summarizes the advantages and disadvantages of each parser library: 258 259+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 260| Parser | Typical usage | Advantages | Disadvantages | 261+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 262| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not very lenient | 263| | | * Decent speed | (before Python 2.7.3 | 264| | | * Lenient (as of Python 2.7.3 | or 3.2.2) | 265| | | and 3.2.) | | 266+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 267| lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency | 268| | | * Lenient | | 269+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 270| lxml's XML parser | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast | * External C dependency | 271| | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | | 272| | | XML parser | | 273+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 274| html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow | 275| | | * Parses pages the same way a | * External Python | 276| | | web browser does | dependency | 277| | | * Creates valid HTML5 | | 278+----------------------+--------------------------------------------+--------------------------------+--------------------------+ 279 280If you can, I recommend you install and use lxml for speed. If you're 281using a version of Python 2 earlier than 2.7.3, or a version of Python 2823 earlier than 3.2.2, it's `essential` that you install lxml or 283html5lib--Python's built-in HTML parser is just not very good in older 284versions. 285 286Note that if a document is invalid, different parsers will generate 287different Beautiful Soup trees for it. See `Differences 288between parsers`_ for details. 289 290Making the soup 291=============== 292 293To parse a document, pass it into the ``BeautifulSoup`` 294constructor. You can pass in a string or an open filehandle:: 295 296 from bs4 import BeautifulSoup 297 298 soup = BeautifulSoup(open("index.html")) 299 300 soup = BeautifulSoup("<html>data</html>") 301 302First, the document is converted to Unicode, and HTML entities are 303converted to Unicode characters:: 304 305 BeautifulSoup("Sacré bleu!") 306 <html><head></head><body>Sacré bleu!</body></html> 307 308Beautiful Soup then parses the document using the best available 309parser. It will use an HTML parser unless you specifically tell it to 310use an XML parser. (See `Parsing XML`_.) 311 312Kinds of objects 313================ 314 315Beautiful Soup transforms a complex HTML document into a complex tree 316of Python objects. But you'll only ever have to deal with about four 317`kinds` of objects. 318 319.. _Tag: 320 321``Tag`` 322------- 323 324A ``Tag`` object corresponds to an XML or HTML tag in the original document:: 325 326 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 327 tag = soup.b 328 type(tag) 329 # <class 'bs4.element.Tag'> 330 331Tags have a lot of attributes and methods, and I'll cover most of them 332in `Navigating the tree`_ and `Searching the tree`_. For now, the most 333important features of a tag are its name and attributes. 334 335Name 336^^^^ 337 338Every tag has a name, accessible as ``.name``:: 339 340 tag.name 341 # u'b' 342 343If you change a tag's name, the change will be reflected in any HTML 344markup generated by Beautiful Soup:: 345 346 tag.name = "blockquote" 347 tag 348 # <blockquote class="boldest">Extremely bold</blockquote> 349 350Attributes 351^^^^^^^^^^ 352 353A tag may have any number of attributes. The tag ``<b 354class="boldest">`` has an attribute "class" whose value is 355"boldest". You can access a tag's attributes by treating the tag like 356a dictionary:: 357 358 tag['class'] 359 # u'boldest' 360 361You can access that dictionary directly as ``.attrs``:: 362 363 tag.attrs 364 # {u'class': u'boldest'} 365 366You can add, remove, and modify a tag's attributes. Again, this is 367done by treating the tag as a dictionary:: 368 369 tag['class'] = 'verybold' 370 tag['id'] = 1 371 tag 372 # <blockquote class="verybold" id="1">Extremely bold</blockquote> 373 374 del tag['class'] 375 del tag['id'] 376 tag 377 # <blockquote>Extremely bold</blockquote> 378 379 tag['class'] 380 # KeyError: 'class' 381 print(tag.get('class')) 382 # None 383 384.. _multivalue: 385 386Multi-valued attributes 387&&&&&&&&&&&&&&&&&&&&&&& 388 389HTML 4 defines a few attributes that can have multiple values. HTML 5 390removes a couple of them, but defines a few more. The most common 391multi-valued attribute is ``class`` (that is, a tag can have more than 392one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, 393``headers``, and ``accesskey``. Beautiful Soup presents the value(s) 394of a multi-valued attribute as a list:: 395 396 css_soup = BeautifulSoup('<p class="body strikeout"></p>') 397 css_soup.p['class'] 398 # ["body", "strikeout"] 399 400 css_soup = BeautifulSoup('<p class="body"></p>') 401 css_soup.p['class'] 402 # ["body"] 403 404If an attribute `looks` like it has more than one value, but it's not 405a multi-valued attribute as defined by any version of the HTML 406standard, Beautiful Soup will leave the attribute alone:: 407 408 id_soup = BeautifulSoup('<p id="my id"></p>') 409 id_soup.p['id'] 410 # 'my id' 411 412When you turn a tag back into a string, multiple attribute values are 413consolidated:: 414 415 rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') 416 rel_soup.a['rel'] 417 # ['index'] 418 rel_soup.a['rel'] = ['index', 'contents'] 419 print(rel_soup.p) 420 # <p>Back to the <a rel="index contents">homepage</a></p> 421 422If you parse a document as XML, there are no multi-valued attributes:: 423 424 xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') 425 xml_soup.p['class'] 426 # u'body strikeout' 427 428 429 430``NavigableString`` 431------------------- 432 433A string corresponds to a bit of text within a tag. Beautiful Soup 434uses the ``NavigableString`` class to contain these bits of text:: 435 436 tag.string 437 # u'Extremely bold' 438 type(tag.string) 439 # <class 'bs4.element.NavigableString'> 440 441A ``NavigableString`` is just like a Python Unicode string, except 442that it also supports some of the features described in `Navigating 443the tree`_ and `Searching the tree`_. You can convert a 444``NavigableString`` to a Unicode string with ``unicode()``:: 445 446 unicode_string = unicode(tag.string) 447 unicode_string 448 # u'Extremely bold' 449 type(unicode_string) 450 # <type 'unicode'> 451 452You can't edit a string in place, but you can replace one string with 453another, using :ref:`replace_with`:: 454 455 tag.string.replace_with("No longer bold") 456 tag 457 # <blockquote>No longer bold</blockquote> 458 459``NavigableString`` supports most of the features described in 460`Navigating the tree`_ and `Searching the tree`_, but not all of 461them. In particular, since a string can't contain anything (the way a 462tag may contain a string or another tag), strings don't support the 463``.contents`` or ``.string`` attributes, or the ``find()`` method. 464 465If you want to use a ``NavigableString`` outside of Beautiful Soup, 466you should call ``unicode()`` on it to turn it into a normal Python 467Unicode string. If you don't, your string will carry around a 468reference to the entire Beautiful Soup parse tree, even when you're 469done using Beautiful Soup. This is a big waste of memory. 470 471``BeautifulSoup`` 472----------------- 473 474The ``BeautifulSoup`` object itself represents the document as a 475whole. For most purposes, you can treat it as a :ref:`Tag` 476object. This means it supports most of the methods described in 477`Navigating the tree`_ and `Searching the tree`_. 478 479Since the ``BeautifulSoup`` object doesn't correspond to an actual 480HTML or XML tag, it has no name and no attributes. But sometimes it's 481useful to look at its ``.name``, so it's been given the special 482``.name`` "[document]":: 483 484 soup.name 485 # u'[document]' 486 487Comments and other special strings 488---------------------------------- 489 490``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost 491everything you'll see in an HTML or XML file, but there are a few 492leftover bits. The only one you'll probably ever need to worry about 493is the comment:: 494 495 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" 496 soup = BeautifulSoup(markup) 497 comment = soup.b.string 498 type(comment) 499 # <class 'bs4.element.Comment'> 500 501The ``Comment`` object is just a special type of ``NavigableString``:: 502 503 comment 504 # u'Hey, buddy. Want to buy a used parser' 505 506But when it appears as part of an HTML document, a ``Comment`` is 507displayed with special formatting:: 508 509 print(soup.b.prettify()) 510 # <b> 511 # <!--Hey, buddy. Want to buy a used parser?--> 512 # </b> 513 514Beautiful Soup defines classes for anything else that might show up in 515an XML document: ``CData``, ``ProcessingInstruction``, 516``Declaration``, and ``Doctype``. Just like ``Comment``, these classes 517are subclasses of ``NavigableString`` that add something extra to the 518string. Here's an example that replaces the comment with a CDATA 519block:: 520 521 from bs4 import CData 522 cdata = CData("A CDATA block") 523 comment.replace_with(cdata) 524 525 print(soup.b.prettify()) 526 # <b> 527 # <![CDATA[A CDATA block]]> 528 # </b> 529 530 531Navigating the tree 532=================== 533 534Here's the "Three sisters" HTML document again:: 535 536 html_doc = """ 537 <html><head><title>The Dormouse's story</title></head> 538 539 <p class="title"><b>The Dormouse's story</b></p> 540 541 <p class="story">Once upon a time there were three little sisters; and their names were 542 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 543 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 544 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 545 and they lived at the bottom of a well.</p> 546 547 <p class="story">...</p> 548 """ 549 550 from bs4 import BeautifulSoup 551 soup = BeautifulSoup(html_doc) 552 553I'll use this as an example to show you how to move from one part of 554a document to another. 555 556Going down 557---------- 558 559Tags may contain strings and other tags. These elements are the tag's 560`children`. Beautiful Soup provides a lot of different attributes for 561navigating and iterating over a tag's children. 562 563Note that Beautiful Soup strings don't support any of these 564attributes, because a string can't have children. 565 566Navigating using tag names 567^^^^^^^^^^^^^^^^^^^^^^^^^^ 568 569The simplest way to navigate the parse tree is to say the name of the 570tag you want. If you want the <head> tag, just say ``soup.head``:: 571 572 soup.head 573 # <head><title>The Dormouse's story</title></head> 574 575 soup.title 576 # <title>The Dormouse's story</title> 577 578You can do use this trick again and again to zoom in on a certain part 579of the parse tree. This code gets the first <b> tag beneath the <body> tag:: 580 581 soup.body.b 582 # <b>The Dormouse's story</b> 583 584Using a tag name as an attribute will give you only the `first` tag by that 585name:: 586 587 soup.a 588 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 589 590If you need to get `all` the <a> tags, or anything more complicated 591than the first tag with a certain name, you'll need to use one of the 592methods described in `Searching the tree`_, such as `find_all()`:: 593 594 soup.find_all('a') 595 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 596 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 597 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 598 599``.contents`` and ``.children`` 600^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 601 602A tag's children are available in a list called ``.contents``:: 603 604 head_tag = soup.head 605 head_tag 606 # <head><title>The Dormouse's story</title></head> 607 608 head_tag.contents 609 [<title>The Dormouse's story</title>] 610 611 title_tag = head_tag.contents[0] 612 title_tag 613 # <title>The Dormouse's story</title> 614 title_tag.contents 615 # [u'The Dormouse's story'] 616 617The ``BeautifulSoup`` object itself has children. In this case, the 618<html> tag is the child of the ``BeautifulSoup`` object.:: 619 620 len(soup.contents) 621 # 1 622 soup.contents[0].name 623 # u'html' 624 625A string does not have ``.contents``, because it can't contain 626anything:: 627 628 text = title_tag.contents[0] 629 text.contents 630 # AttributeError: 'NavigableString' object has no attribute 'contents' 631 632Instead of getting them as a list, you can iterate over a tag's 633children using the ``.children`` generator:: 634 635 for child in title_tag.children: 636 print(child) 637 # The Dormouse's story 638 639``.descendants`` 640^^^^^^^^^^^^^^^^ 641 642The ``.contents`` and ``.children`` attributes only consider a tag's 643`direct` children. For instance, the <head> tag has a single direct 644child--the <title> tag:: 645 646 head_tag.contents 647 # [<title>The Dormouse's story</title>] 648 649But the <title> tag itself has a child: the string "The Dormouse's 650story". There's a sense in which that string is also a child of the 651<head> tag. The ``.descendants`` attribute lets you iterate over `all` 652of a tag's children, recursively: its direct children, the children of 653its direct children, and so on:: 654 655 for child in head_tag.descendants: 656 print(child) 657 # <title>The Dormouse's story</title> 658 # The Dormouse's story 659 660The <head> tag has only one child, but it has two descendants: the 661<title> tag and the <title> tag's child. The ``BeautifulSoup`` object 662only has one direct child (the <html> tag), but it has a whole lot of 663descendants:: 664 665 len(list(soup.children)) 666 # 1 667 len(list(soup.descendants)) 668 # 25 669 670.. _.string: 671 672``.string`` 673^^^^^^^^^^^ 674 675If a tag has only one child, and that child is a ``NavigableString``, 676the child is made available as ``.string``:: 677 678 title_tag.string 679 # u'The Dormouse's story' 680 681If a tag's only child is another tag, and `that` tag has a 682``.string``, then the parent tag is considered to have the same 683``.string`` as its child:: 684 685 head_tag.contents 686 # [<title>The Dormouse's story</title>] 687 688 head_tag.string 689 # u'The Dormouse's story' 690 691If a tag contains more than one thing, then it's not clear what 692``.string`` should refer to, so ``.string`` is defined to be 693``None``:: 694 695 print(soup.html.string) 696 # None 697 698.. _string-generators: 699 700``.strings`` and ``stripped_strings`` 701^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 702 703If there's more than one thing inside a tag, you can still look at 704just the strings. Use the ``.strings`` generator:: 705 706 for string in soup.strings: 707 print(repr(string)) 708 # u"The Dormouse's story" 709 # u'\n\n' 710 # u"The Dormouse's story" 711 # u'\n\n' 712 # u'Once upon a time there were three little sisters; and their names were\n' 713 # u'Elsie' 714 # u',\n' 715 # u'Lacie' 716 # u' and\n' 717 # u'Tillie' 718 # u';\nand they lived at the bottom of a well.' 719 # u'\n\n' 720 # u'...' 721 # u'\n' 722 723These strings tend to have a lot of extra whitespace, which you can 724remove by using the ``.stripped_strings`` generator instead:: 725 726 for string in soup.stripped_strings: 727 print(repr(string)) 728 # u"The Dormouse's story" 729 # u"The Dormouse's story" 730 # u'Once upon a time there were three little sisters; and their names were' 731 # u'Elsie' 732 # u',' 733 # u'Lacie' 734 # u'and' 735 # u'Tillie' 736 # u';\nand they lived at the bottom of a well.' 737 # u'...' 738 739Here, strings consisting entirely of whitespace are ignored, and 740whitespace at the beginning and end of strings is removed. 741 742Going up 743-------- 744 745Continuing the "family tree" analogy, every tag and every string has a 746`parent`: the tag that contains it. 747 748.. _.parent: 749 750``.parent`` 751^^^^^^^^^^^ 752 753You can access an element's parent with the ``.parent`` attribute. In 754the example "three sisters" document, the <head> tag is the parent 755of the <title> tag:: 756 757 title_tag = soup.title 758 title_tag 759 # <title>The Dormouse's story</title> 760 title_tag.parent 761 # <head><title>The Dormouse's story</title></head> 762 763The title string itself has a parent: the <title> tag that contains 764it:: 765 766 title_tag.string.parent 767 # <title>The Dormouse's story</title> 768 769The parent of a top-level tag like <html> is the ``BeautifulSoup`` object 770itself:: 771 772 html_tag = soup.html 773 type(html_tag.parent) 774 # <class 'bs4.BeautifulSoup'> 775 776And the ``.parent`` of a ``BeautifulSoup`` object is defined as None:: 777 778 print(soup.parent) 779 # None 780 781.. _.parents: 782 783``.parents`` 784^^^^^^^^^^^^ 785 786You can iterate over all of an element's parents with 787``.parents``. This example uses ``.parents`` to travel from an <a> tag 788buried deep within the document, to the very top of the document:: 789 790 link = soup.a 791 link 792 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 793 for parent in link.parents: 794 if parent is None: 795 print(parent) 796 else: 797 print(parent.name) 798 # p 799 # body 800 # html 801 # [document] 802 # None 803 804Going sideways 805-------------- 806 807Consider a simple document like this:: 808 809 sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") 810 print(sibling_soup.prettify()) 811 # <html> 812 # <body> 813 # <a> 814 # <b> 815 # text1 816 # </b> 817 # <c> 818 # text2 819 # </c> 820 # </a> 821 # </body> 822 # </html> 823 824The <b> tag and the <c> tag are at the same level: they're both direct 825children of the same tag. We call them `siblings`. When a document is 826pretty-printed, siblings show up at the same indentation level. You 827can also use this relationship in the code you write. 828 829``.next_sibling`` and ``.previous_sibling`` 830^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 831 832You can use ``.next_sibling`` and ``.previous_sibling`` to navigate 833between page elements that are on the same level of the parse tree:: 834 835 sibling_soup.b.next_sibling 836 # <c>text2</c> 837 838 sibling_soup.c.previous_sibling 839 # <b>text1</b> 840 841The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``, 842because there's nothing before the <b> tag `on the same level of the 843tree`. For the same reason, the <c> tag has a ``.previous_sibling`` 844but no ``.next_sibling``:: 845 846 print(sibling_soup.b.previous_sibling) 847 # None 848 print(sibling_soup.c.next_sibling) 849 # None 850 851The strings "text1" and "text2" are `not` siblings, because they don't 852have the same parent:: 853 854 sibling_soup.b.string 855 # u'text1' 856 857 print(sibling_soup.b.string.next_sibling) 858 # None 859 860In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a 861tag will usually be a string containing whitespace. Going back to the 862"three sisters" document:: 863 864 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> 865 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 866 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 867 868You might think that the ``.next_sibling`` of the first <a> tag would 869be the second <a> tag. But actually, it's a string: the comma and 870newline that separate the first <a> tag from the second:: 871 872 link = soup.a 873 link 874 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 875 876 link.next_sibling 877 # u',\n' 878 879The second <a> tag is actually the ``.next_sibling`` of the comma:: 880 881 link.next_sibling.next_sibling 882 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 883 884.. _sibling-generators: 885 886``.next_siblings`` and ``.previous_siblings`` 887^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 888 889You can iterate over a tag's siblings with ``.next_siblings`` or 890``.previous_siblings``:: 891 892 for sibling in soup.a.next_siblings: 893 print(repr(sibling)) 894 # u',\n' 895 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 896 # u' and\n' 897 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 898 # u'; and they lived at the bottom of a well.' 899 # None 900 901 for sibling in soup.find(id="link3").previous_siblings: 902 print(repr(sibling)) 903 # ' and\n' 904 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 905 # u',\n' 906 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 907 # u'Once upon a time there were three little sisters; and their names were\n' 908 # None 909 910Going back and forth 911-------------------- 912 913Take a look at the beginning of the "three sisters" document:: 914 915 <html><head><title>The Dormouse's story</title></head> 916 <p class="title"><b>The Dormouse's story</b></p> 917 918An HTML parser takes this string of characters and turns it into a 919series of events: "open an <html> tag", "open a <head> tag", "open a 920<title> tag", "add a string", "close the <title> tag", "open a <p> 921tag", and so on. Beautiful Soup offers tools for reconstructing the 922initial parse of the document. 923 924.. _element-generators: 925 926``.next_element`` and ``.previous_element`` 927^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 928 929The ``.next_element`` attribute of a string or tag points to whatever 930was parsed immediately afterwards. It might be the same as 931``.next_sibling``, but it's usually drastically different. 932 933Here's the final <a> tag in the "three sisters" document. Its 934``.next_sibling`` is a string: the conclusion of the sentence that was 935interrupted by the start of the <a> tag.:: 936 937 last_a_tag = soup.find("a", id="link3") 938 last_a_tag 939 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 940 941 last_a_tag.next_sibling 942 # '; and they lived at the bottom of a well.' 943 944But the ``.next_element`` of that <a> tag, the thing that was parsed 945immediately after the <a> tag, is `not` the rest of that sentence: 946it's the word "Tillie":: 947 948 last_a_tag.next_element 949 # u'Tillie' 950 951That's because in the original markup, the word "Tillie" appeared 952before that semicolon. The parser encountered an <a> tag, then the 953word "Tillie", then the closing </a> tag, then the semicolon and rest of 954the sentence. The semicolon is on the same level as the <a> tag, but the 955word "Tillie" was encountered first. 956 957The ``.previous_element`` attribute is the exact opposite of 958``.next_element``. It points to whatever element was parsed 959immediately before this one:: 960 961 last_a_tag.previous_element 962 # u' and\n' 963 last_a_tag.previous_element.next_element 964 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 965 966``.next_elements`` and ``.previous_elements`` 967^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 968 969You should get the idea by now. You can use these iterators to move 970forward or backward in the document as it was parsed:: 971 972 for element in last_a_tag.next_elements: 973 print(repr(element)) 974 # u'Tillie' 975 # u';\nand they lived at the bottom of a well.' 976 # u'\n\n' 977 # <p class="story">...</p> 978 # u'...' 979 # u'\n' 980 # None 981 982Searching the tree 983================== 984 985Beautiful Soup defines a lot of methods for searching the parse tree, 986but they're all very similar. I'm going to spend a lot of time explaining 987the two most popular methods: ``find()`` and ``find_all()``. The other 988methods take almost exactly the same arguments, so I'll just cover 989them briefly. 990 991Once again, I'll be using the "three sisters" document as an example:: 992 993 html_doc = """ 994 <html><head><title>The Dormouse's story</title></head> 995 996 <p class="title"><b>The Dormouse's story</b></p> 997 998 <p class="story">Once upon a time there were three little sisters; and their names were 999 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 1000 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 1001 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 1002 and they lived at the bottom of a well.</p> 1003 1004 <p class="story">...</p> 1005 """ 1006 1007 from bs4 import BeautifulSoup 1008 soup = BeautifulSoup(html_doc) 1009 1010By passing in a filter to an argument like ``find_all()``, you can 1011zoom in on the parts of the document you're interested in. 1012 1013Kinds of filters 1014---------------- 1015 1016Before talking in detail about ``find_all()`` and similar methods, I 1017want to show examples of different filters you can pass into these 1018methods. These filters show up again and again, throughout the 1019search API. You can use them to filter based on a tag's name, 1020on its attributes, on the text of a string, or on some combination of 1021these. 1022 1023.. _a string: 1024 1025A string 1026^^^^^^^^ 1027 1028The simplest filter is a string. Pass a string to a search method and 1029Beautiful Soup will perform a match against that exact string. This 1030code finds all the <b> tags in the document:: 1031 1032 soup.find_all('b') 1033 # [<b>The Dormouse's story</b>] 1034 1035If you pass in a byte string, Beautiful Soup will assume the string is 1036encoded as UTF-8. You can avoid this by passing in a Unicode string instead. 1037 1038.. _a regular expression: 1039 1040A regular expression 1041^^^^^^^^^^^^^^^^^^^^ 1042 1043If you pass in a regular expression object, Beautiful Soup will filter 1044against that regular expression using its ``match()`` method. This code 1045finds all the tags whose names start with the letter "b"; in this 1046case, the <body> tag and the <b> tag:: 1047 1048 import re 1049 for tag in soup.find_all(re.compile("^b")): 1050 print(tag.name) 1051 # body 1052 # b 1053 1054This code finds all the tags whose names contain the letter 't':: 1055 1056 for tag in soup.find_all(re.compile("t")): 1057 print(tag.name) 1058 # html 1059 # title 1060 1061.. _a list: 1062 1063A list 1064^^^^^^ 1065 1066If you pass in a list, Beautiful Soup will allow a string match 1067against `any` item in that list. This code finds all the <a> tags 1068`and` all the <b> tags:: 1069 1070 soup.find_all(["a", "b"]) 1071 # [<b>The Dormouse's story</b>, 1072 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1073 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1074 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1075 1076.. _the value True: 1077 1078``True`` 1079^^^^^^^^ 1080 1081The value ``True`` matches everything it can. This code finds `all` 1082the tags in the document, but none of the text strings:: 1083 1084 for tag in soup.find_all(True): 1085 print(tag.name) 1086 # html 1087 # head 1088 # title 1089 # body 1090 # p 1091 # b 1092 # p 1093 # a 1094 # a 1095 # a 1096 # p 1097 1098.. a function: 1099 1100A function 1101^^^^^^^^^^ 1102 1103If none of the other matches work for you, define a function that 1104takes an element as its only argument. The function should return 1105``True`` if the argument matches, and ``False`` otherwise. 1106 1107Here's a function that returns ``True`` if a tag defines the "class" 1108attribute but doesn't define the "id" attribute:: 1109 1110 def has_class_but_no_id(tag): 1111 return tag.has_attr('class') and not tag.has_attr('id') 1112 1113Pass this function into ``find_all()`` and you'll pick up all the <p> 1114tags:: 1115 1116 soup.find_all(has_class_but_no_id) 1117 # [<p class="title"><b>The Dormouse's story</b></p>, 1118 # <p class="story">Once upon a time there were...</p>, 1119 # <p class="story">...</p>] 1120 1121This function only picks up the <p> tags. It doesn't pick up the <a> 1122tags, because those tags define both "class" and "id". It doesn't pick 1123up tags like <html> and <title>, because those tags don't define 1124"class". 1125 1126Here's a function that returns ``True`` if a tag is surrounded by 1127string objects:: 1128 1129 from bs4 import NavigableString 1130 def surrounded_by_strings(tag): 1131 return (isinstance(tag.next_element, NavigableString) 1132 and isinstance(tag.previous_element, NavigableString)) 1133 1134 for tag in soup.find_all(surrounded_by_strings): 1135 print tag.name 1136 # p 1137 # a 1138 # a 1139 # a 1140 # p 1141 1142Now we're ready to look at the search methods in detail. 1143 1144``find_all()`` 1145-------------- 1146 1147Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive 1148<recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1149 1150The ``find_all()`` method looks through a tag's descendants and 1151retrieves `all` descendants that match your filters. I gave several 1152examples in `Kinds of filters`_, but here are a few more:: 1153 1154 soup.find_all("title") 1155 # [<title>The Dormouse's story</title>] 1156 1157 soup.find_all("p", "title") 1158 # [<p class="title"><b>The Dormouse's story</b></p>] 1159 1160 soup.find_all("a") 1161 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1162 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1163 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1164 1165 soup.find_all(id="link2") 1166 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1167 1168 import re 1169 soup.find(text=re.compile("sisters")) 1170 # u'Once upon a time there were three little sisters; and their names were\n' 1171 1172Some of these should look familiar, but others are new. What does it 1173mean to pass in a value for ``text``, or ``id``? Why does 1174``find_all("p", "title")`` find a <p> tag with the CSS class "title"? 1175Let's look at the arguments to ``find_all()``. 1176 1177.. _name: 1178 1179The ``name`` argument 1180^^^^^^^^^^^^^^^^^^^^^ 1181 1182Pass in a value for ``name`` and you'll tell Beautiful Soup to only 1183consider tags with certain names. Text strings will be ignored, as 1184will tags whose names that don't match. 1185 1186This is the simplest usage:: 1187 1188 soup.find_all("title") 1189 # [<title>The Dormouse's story</title>] 1190 1191Recall from `Kinds of filters`_ that the value to ``name`` can be `a 1192string`_, `a regular expression`_, `a list`_, `a function`_, or `the value 1193True`_. 1194 1195.. _kwargs: 1196 1197The keyword arguments 1198^^^^^^^^^^^^^^^^^^^^^ 1199 1200Any argument that's not recognized will be turned into a filter on one 1201of a tag's attributes. If you pass in a value for an argument called ``id``, 1202Beautiful Soup will filter against each tag's 'id' attribute:: 1203 1204 soup.find_all(id='link2') 1205 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1206 1207If you pass in a value for ``href``, Beautiful Soup will filter 1208against each tag's 'href' attribute:: 1209 1210 soup.find_all(href=re.compile("elsie")) 1211 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1212 1213You can filter an attribute based on `a string`_, `a regular 1214expression`_, `a list`_, `a function`_, or `the value True`_. 1215 1216This code finds all tags whose ``id`` attribute has a value, 1217regardless of what the value is:: 1218 1219 soup.find_all(id=True) 1220 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1221 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1222 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1223 1224You can filter multiple attributes at once by passing in more than one 1225keyword argument:: 1226 1227 soup.find_all(href=re.compile("elsie"), id='link1') 1228 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] 1229 1230Some attributes, like the data-* attributes in HTML 5, have names that 1231can't be used as the names of keyword arguments:: 1232 1233 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 1234 data_soup.find_all(data-foo="value") 1235 # SyntaxError: keyword can't be an expression 1236 1237You can use these attributes in searches by putting them into a 1238dictionary and passing the dictionary into ``find_all()`` as the 1239``attrs`` argument:: 1240 1241 data_soup.find_all(attrs={"data-foo": "value"}) 1242 # [<div data-foo="value">foo!</div>] 1243 1244.. _attrs: 1245 1246Searching by CSS class 1247^^^^^^^^^^^^^^^^^^^^^^ 1248 1249It's very useful to search for a tag that has a certain CSS class, but 1250the name of the CSS attribute, "class", is a reserved word in 1251Python. Using ``class`` as a keyword argument will give you a syntax 1252error. As of Beautiful Soup 4.1.2, you can search by CSS class using 1253the keyword argument ``class_``:: 1254 1255 soup.find_all("a", class_="sister") 1256 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1257 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1258 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1259 1260As with any keyword argument, you can pass ``class_`` a string, a regular 1261expression, a function, or ``True``:: 1262 1263 soup.find_all(class_=re.compile("itl")) 1264 # [<p class="title"><b>The Dormouse's story</b></p>] 1265 1266 def has_six_characters(css_class): 1267 return css_class is not None and len(css_class) == 6 1268 1269 soup.find_all(class_=has_six_characters) 1270 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1271 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1272 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1273 1274:ref:`Remember <multivalue>` that a single tag can have multiple 1275values for its "class" attribute. When you search for a tag that 1276matches a certain CSS class, you're matching against `any` of its CSS 1277classes:: 1278 1279 css_soup = BeautifulSoup('<p class="body strikeout"></p>') 1280 css_soup.find_all("p", class_="strikeout") 1281 # [<p class="body strikeout"></p>] 1282 1283 css_soup.find_all("p", class_="body") 1284 # [<p class="body strikeout"></p>] 1285 1286You can also search for the exact string value of the ``class`` attribute:: 1287 1288 css_soup.find_all("p", class_="body strikeout") 1289 # [<p class="body strikeout"></p>] 1290 1291But searching for variants of the string value won't work:: 1292 1293 css_soup.find_all("p", class_="strikeout body") 1294 # [] 1295 1296If you want to search for tags that match two or more CSS classes, you 1297should use a CSS selector:: 1298 1299 css_soup.select("p.strikeout.body") 1300 # [<p class="body strikeout"></p>] 1301 1302In older versions of Beautiful Soup, which don't have the ``class_`` 1303shortcut, you can use the ``attrs`` trick mentioned above. Create a 1304dictionary whose value for "class" is the string (or regular 1305expression, or whatever) you want to search for:: 1306 1307 soup.find_all("a", attrs={"class": "sister"}) 1308 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1309 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1310 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1311 1312.. _text: 1313 1314The ``text`` argument 1315^^^^^^^^^^^^^^^^^^^^^ 1316 1317With ``text`` you can search for strings instead of tags. As with 1318``name`` and the keyword arguments, you can pass in `a string`_, `a 1319regular expression`_, `a list`_, `a function`_, or `the value True`_. 1320Here are some examples:: 1321 1322 soup.find_all(text="Elsie") 1323 # [u'Elsie'] 1324 1325 soup.find_all(text=["Tillie", "Elsie", "Lacie"]) 1326 # [u'Elsie', u'Lacie', u'Tillie'] 1327 1328 soup.find_all(text=re.compile("Dormouse")) 1329 [u"The Dormouse's story", u"The Dormouse's story"] 1330 1331 def is_the_only_string_within_a_tag(s): 1332 """Return True if this string is the only child of its parent tag.""" 1333 return (s == s.parent.string) 1334 1335 soup.find_all(text=is_the_only_string_within_a_tag) 1336 # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] 1337 1338Although ``text`` is for finding strings, you can combine it with 1339arguments that find tags: Beautiful Soup will find all tags whose 1340``.string`` matches your value for ``text``. This code finds the <a> 1341tags whose ``.string`` is "Elsie":: 1342 1343 soup.find_all("a", text="Elsie") 1344 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] 1345 1346.. _limit: 1347 1348The ``limit`` argument 1349^^^^^^^^^^^^^^^^^^^^^^ 1350 1351``find_all()`` returns all the tags and strings that match your 1352filters. This can take a while if the document is large. If you don't 1353need `all` the results, you can pass in a number for ``limit``. This 1354works just like the LIMIT keyword in SQL. It tells Beautiful Soup to 1355stop gathering results after it's found a certain number. 1356 1357There are three links in the "three sisters" document, but this code 1358only finds the first two:: 1359 1360 soup.find_all("a", limit=2) 1361 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1362 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1363 1364.. _recursive: 1365 1366The ``recursive`` argument 1367^^^^^^^^^^^^^^^^^^^^^^^^^^ 1368 1369If you call ``mytag.find_all()``, Beautiful Soup will examine all the 1370descendants of ``mytag``: its children, its children's children, and 1371so on. If you only want Beautiful Soup to consider direct children, 1372you can pass in ``recursive=False``. See the difference here:: 1373 1374 soup.html.find_all("title") 1375 # [<title>The Dormouse's story</title>] 1376 1377 soup.html.find_all("title", recursive=False) 1378 # [] 1379 1380Here's that part of the document:: 1381 1382 <html> 1383 <head> 1384 <title> 1385 The Dormouse's story 1386 </title> 1387 </head> 1388 ... 1389 1390The <title> tag is beneath the <html> tag, but it's not `directly` 1391beneath the <html> tag: the <head> tag is in the way. Beautiful Soup 1392finds the <title> tag when it's allowed to look at all descendants of 1393the <html> tag, but when ``recursive=False`` restricts it to the 1394<html> tag's immediate children, it finds nothing. 1395 1396Beautiful Soup offers a lot of tree-searching methods (covered below), 1397and they mostly take the same arguments as ``find_all()``: ``name``, 1398``attrs``, ``text``, ``limit``, and the keyword arguments. But the 1399``recursive`` argument is different: ``find_all()`` and ``find()`` are 1400the only methods that support it. Passing ``recursive=False`` into a 1401method like ``find_parents()`` wouldn't be very useful. 1402 1403Calling a tag is like calling ``find_all()`` 1404-------------------------------------------- 1405 1406Because ``find_all()`` is the most popular method in the Beautiful 1407Soup search API, you can use a shortcut for it. If you treat the 1408``BeautifulSoup`` object or a ``Tag`` object as though it were a 1409function, then it's the same as calling ``find_all()`` on that 1410object. These two lines of code are equivalent:: 1411 1412 soup.find_all("a") 1413 soup("a") 1414 1415These two lines are also equivalent:: 1416 1417 soup.title.find_all(text=True) 1418 soup.title(text=True) 1419 1420``find()`` 1421---------- 1422 1423Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive 1424<recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1425 1426The ``find_all()`` method scans the entire document looking for 1427results, but sometimes you only want to find one result. If you know a 1428document only has one <body> tag, it's a waste of time to scan the 1429entire document looking for more. Rather than passing in ``limit=1`` 1430every time you call ``find_all``, you can use the ``find()`` 1431method. These two lines of code are `nearly` equivalent:: 1432 1433 soup.find_all('title', limit=1) 1434 # [<title>The Dormouse's story</title>] 1435 1436 soup.find('title') 1437 # <title>The Dormouse's story</title> 1438 1439The only difference is that ``find_all()`` returns a list containing 1440the single result, and ``find()`` just returns the result. 1441 1442If ``find_all()`` can't find anything, it returns an empty list. If 1443``find()`` can't find anything, it returns ``None``:: 1444 1445 print(soup.find("nosuchtag")) 1446 # None 1447 1448Remember the ``soup.head.title`` trick from `Navigating using tag 1449names`_? That trick works by repeatedly calling ``find()``:: 1450 1451 soup.head.title 1452 # <title>The Dormouse's story</title> 1453 1454 soup.find("head").find("title") 1455 # <title>The Dormouse's story</title> 1456 1457``find_parents()`` and ``find_parent()`` 1458---------------------------------------- 1459 1460Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1461 1462Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1463 1464I spent a lot of time above covering ``find_all()`` and 1465``find()``. The Beautiful Soup API defines ten other methods for 1466searching the tree, but don't be afraid. Five of these methods are 1467basically the same as ``find_all()``, and the other five are basically 1468the same as ``find()``. The only differences are in what parts of the 1469tree they search. 1470 1471First let's consider ``find_parents()`` and 1472``find_parent()``. Remember that ``find_all()`` and ``find()`` work 1473their way down the tree, looking at tag's descendants. These methods 1474do the opposite: they work their way `up` the tree, looking at a tag's 1475(or a string's) parents. Let's try them out, starting from a string 1476buried deep in the "three daughters" document:: 1477 1478 a_string = soup.find(text="Lacie") 1479 a_string 1480 # u'Lacie' 1481 1482 a_string.find_parents("a") 1483 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1484 1485 a_string.find_parent("p") 1486 # <p class="story">Once upon a time there were three little sisters; and their names were 1487 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1488 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and 1489 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 1490 # and they lived at the bottom of a well.</p> 1491 1492 a_string.find_parents("p", class="title") 1493 # [] 1494 1495One of the three <a> tags is the direct parent of the string in 1496question, so our search finds it. One of the three <p> tags is an 1497indirect parent of the string, and our search finds that as 1498well. There's a <p> tag with the CSS class "title" `somewhere` in the 1499document, but it's not one of this string's parents, so we can't find 1500it with ``find_parents()``. 1501 1502You may have made the connection between ``find_parent()`` and 1503``find_parents()``, and the `.parent`_ and `.parents`_ attributes 1504mentioned earlier. The connection is very strong. These search methods 1505actually use ``.parents`` to iterate over all the parents, and check 1506each one against the provided filter to see if it matches. 1507 1508``find_next_siblings()`` and ``find_next_sibling()`` 1509---------------------------------------------------- 1510 1511Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1512 1513Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1514 1515These methods use :ref:`.next_siblings <sibling-generators>` to 1516iterate over the rest of an element's siblings in the tree. The 1517``find_next_siblings()`` method returns all the siblings that match, 1518and ``find_next_sibling()`` only returns the first one:: 1519 1520 first_link = soup.a 1521 first_link 1522 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1523 1524 first_link.find_next_siblings("a") 1525 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1526 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1527 1528 first_story_paragraph = soup.find("p", "story") 1529 first_story_paragraph.find_next_sibling("p") 1530 # <p class="story">...</p> 1531 1532``find_previous_siblings()`` and ``find_previous_sibling()`` 1533------------------------------------------------------------ 1534 1535Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1536 1537Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1538 1539These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's 1540siblings that precede it in the tree. The ``find_previous_siblings()`` 1541method returns all the siblings that match, and 1542``find_previous_sibling()`` only returns the first one:: 1543 1544 last_link = soup.find("a", id="link3") 1545 last_link 1546 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 1547 1548 last_link.find_previous_siblings("a") 1549 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1550 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1551 1552 first_story_paragraph = soup.find("p", "story") 1553 first_story_paragraph.find_previous_sibling("p") 1554 # <p class="title"><b>The Dormouse's story</b></p> 1555 1556 1557``find_all_next()`` and ``find_next()`` 1558--------------------------------------- 1559 1560Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1561 1562Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1563 1564These methods use :ref:`.next_elements <element-generators>` to 1565iterate over whatever tags and strings that come after it in the 1566document. The ``find_all_next()`` method returns all matches, and 1567``find_next()`` only returns the first match:: 1568 1569 first_link = soup.a 1570 first_link 1571 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1572 1573 first_link.find_all_next(text=True) 1574 # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', 1575 # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] 1576 1577 first_link.find_next("p") 1578 # <p class="story">...</p> 1579 1580In the first example, the string "Elsie" showed up, even though it was 1581contained within the <a> tag we started from. In the second example, 1582the last <p> tag in the document showed up, even though it's not in 1583the same part of the tree as the <a> tag we started from. For these 1584methods, all that matters is that an element match the filter, and 1585show up later in the document than the starting element. 1586 1587``find_all_previous()`` and ``find_previous()`` 1588----------------------------------------------- 1589 1590Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) 1591 1592Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) 1593 1594These methods use :ref:`.previous_elements <element-generators>` to 1595iterate over the tags and strings that came before it in the 1596document. The ``find_all_previous()`` method returns all matches, and 1597``find_previous()`` only returns the first match:: 1598 1599 first_link = soup.a 1600 first_link 1601 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1602 1603 first_link.find_all_previous("p") 1604 # [<p class="story">Once upon a time there were three little sisters; ...</p>, 1605 # <p class="title"><b>The Dormouse's story</b></p>] 1606 1607 first_link.find_previous("title") 1608 # <title>The Dormouse's story</title> 1609 1610The call to ``find_all_previous("p")`` found the first paragraph in 1611the document (the one with class="title"), but it also finds the 1612second paragraph, the <p> tag that contains the <a> tag we started 1613with. This shouldn't be too surprising: we're looking at all the tags 1614that show up earlier in the document than the one we started with. A 1615<p> tag that contains an <a> tag must have shown up before the <a> 1616tag it contains. 1617 1618CSS selectors 1619------------- 1620 1621Beautiful Soup supports the most commonly-used `CSS selectors 1622<http://www.w3.org/TR/CSS2/selector.html>`_. Just pass a string into 1623the ``.select()`` method of a ``Tag`` object or the ``BeautifulSoup`` 1624object itself. 1625 1626You can find tags:: 1627 1628 soup.select("title") 1629 # [<title>The Dormouse's story</title>] 1630 1631 soup.select("p nth-of-type(3)") 1632 # [<p class="story">...</p>] 1633 1634Find tags beneath other tags:: 1635 1636 soup.select("body a") 1637 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1638 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1639 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1640 1641 soup.select("html head title") 1642 # [<title>The Dormouse's story</title>] 1643 1644Find tags `directly` beneath other tags:: 1645 1646 soup.select("head > title") 1647 # [<title>The Dormouse's story</title>] 1648 1649 soup.select("p > a") 1650 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1651 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1652 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1653 1654 soup.select("p > a:nth-of-type(2)") 1655 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1656 1657 soup.select("p > #link1") 1658 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1659 1660 soup.select("body > a") 1661 # [] 1662 1663Find the siblings of tags:: 1664 1665 soup.select("#link1 ~ .sister") 1666 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1667 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1668 1669 soup.select("#link1 + .sister") 1670 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1671 1672Find tags by CSS class:: 1673 1674 soup.select(".sister") 1675 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1676 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1677 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1678 1679 soup.select("[class~=sister]") 1680 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1681 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1682 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1683 1684Find tags by ID:: 1685 1686 soup.select("#link1") 1687 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1688 1689 soup.select("a#link2") 1690 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 1691 1692Test for the existence of an attribute:: 1693 1694 soup.select('a[href]') 1695 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1696 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1697 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1698 1699Find tags by attribute value:: 1700 1701 soup.select('a[href="http://example.com/elsie"]') 1702 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1703 1704 soup.select('a[href^="http://example.com/"]') 1705 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 1706 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 1707 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1708 1709 soup.select('a[href$="tillie"]') 1710 # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 1711 1712 soup.select('a[href*=".com/el"]') 1713 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 1714 1715Match language codes:: 1716 1717 multilingual_markup = """ 1718 <p lang="en">Hello</p> 1719 <p lang="en-us">Howdy, y'all</p> 1720 <p lang="en-gb">Pip-pip, old fruit</p> 1721 <p lang="fr">Bonjour mes amis</p> 1722 """ 1723 multilingual_soup = BeautifulSoup(multilingual_markup) 1724 multilingual_soup.select('p[lang|=en]') 1725 # [<p lang="en">Hello</p>, 1726 # <p lang="en-us">Howdy, y'all</p>, 1727 # <p lang="en-gb">Pip-pip, old fruit</p>] 1728 1729This is a convenience for users who know the CSS selector syntax. You 1730can do all this stuff with the Beautiful Soup API. And if CSS 1731selectors are all you need, you might as well use lxml directly, 1732because it's faster. But this lets you `combine` simple CSS selectors 1733with the Beautiful Soup API. 1734 1735 1736Modifying the tree 1737================== 1738 1739Beautiful Soup's main strength is in searching the parse tree, but you 1740can also modify the tree and write your changes as a new HTML or XML 1741document. 1742 1743Changing tag names and attributes 1744--------------------------------- 1745 1746I covered this earlier, in `Attributes`_, but it bears repeating. You 1747can rename a tag, change the values of its attributes, add new 1748attributes, and delete attributes:: 1749 1750 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 1751 tag = soup.b 1752 1753 tag.name = "blockquote" 1754 tag['class'] = 'verybold' 1755 tag['id'] = 1 1756 tag 1757 # <blockquote class="verybold" id="1">Extremely bold</blockquote> 1758 1759 del tag['class'] 1760 del tag['id'] 1761 tag 1762 # <blockquote>Extremely bold</blockquote> 1763 1764 1765Modifying ``.string`` 1766--------------------- 1767 1768If you set a tag's ``.string`` attribute, the tag's contents are 1769replaced with the string you give:: 1770 1771 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1772 soup = BeautifulSoup(markup) 1773 1774 tag = soup.a 1775 tag.string = "New link text." 1776 tag 1777 # <a href="http://example.com/">New link text.</a> 1778 1779Be careful: if the tag contained other tags, they and all their 1780contents will be destroyed. 1781 1782``append()`` 1783------------ 1784 1785You can add to a tag's contents with ``Tag.append()``. It works just 1786like calling ``.append()`` on a Python list:: 1787 1788 soup = BeautifulSoup("<a>Foo</a>") 1789 soup.a.append("Bar") 1790 1791 soup 1792 # <html><head></head><body><a>FooBar</a></body></html> 1793 soup.a.contents 1794 # [u'Foo', u'Bar'] 1795 1796``BeautifulSoup.new_string()`` and ``.new_tag()`` 1797------------------------------------------------- 1798 1799If you need to add a string to a document, no problem--you can pass a 1800Python string in to ``append()``, or you can call the factory method 1801``BeautifulSoup.new_string()``:: 1802 1803 soup = BeautifulSoup("<b></b>") 1804 tag = soup.b 1805 tag.append("Hello") 1806 new_string = soup.new_string(" there") 1807 tag.append(new_string) 1808 tag 1809 # <b>Hello there.</b> 1810 tag.contents 1811 # [u'Hello', u' there'] 1812 1813If you want to create a comment or some other subclass of 1814``NavigableString``, pass that class as the second argument to 1815``new_string()``:: 1816 1817 from bs4 import Comment 1818 new_comment = soup.new_string("Nice to see you.", Comment) 1819 tag.append(new_comment) 1820 tag 1821 # <b>Hello there<!--Nice to see you.--></b> 1822 tag.contents 1823 # [u'Hello', u' there', u'Nice to see you.'] 1824 1825(This is a new feature in Beautiful Soup 4.2.1.) 1826 1827What if you need to create a whole new tag? The best solution is to 1828call the factory method ``BeautifulSoup.new_tag()``:: 1829 1830 soup = BeautifulSoup("<b></b>") 1831 original_tag = soup.b 1832 1833 new_tag = soup.new_tag("a", href="http://www.example.com") 1834 original_tag.append(new_tag) 1835 original_tag 1836 # <b><a href="http://www.example.com"></a></b> 1837 1838 new_tag.string = "Link text." 1839 original_tag 1840 # <b><a href="http://www.example.com">Link text.</a></b> 1841 1842Only the first argument, the tag name, is required. 1843 1844``insert()`` 1845------------ 1846 1847``Tag.insert()`` is just like ``Tag.append()``, except the new element 1848doesn't necessarily go at the end of its parent's 1849``.contents``. It'll be inserted at whatever numeric position you 1850say. It works just like ``.insert()`` on a Python list:: 1851 1852 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1853 soup = BeautifulSoup(markup) 1854 tag = soup.a 1855 1856 tag.insert(1, "but did not endorse ") 1857 tag 1858 # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> 1859 tag.contents 1860 # [u'I linked to ', u'but did not endorse', <i>example.com</i>] 1861 1862``insert_before()`` and ``insert_after()`` 1863------------------------------------------ 1864 1865The ``insert_before()`` method inserts a tag or string immediately 1866before something else in the parse tree:: 1867 1868 soup = BeautifulSoup("<b>stop</b>") 1869 tag = soup.new_tag("i") 1870 tag.string = "Don't" 1871 soup.b.string.insert_before(tag) 1872 soup.b 1873 # <b><i>Don't</i>stop</b> 1874 1875The ``insert_after()`` method moves a tag or string so that it 1876immediately follows something else in the parse tree:: 1877 1878 soup.b.i.insert_after(soup.new_string(" ever ")) 1879 soup.b 1880 # <b><i>Don't</i> ever stop</b> 1881 soup.b.contents 1882 # [<i>Don't</i>, u' ever ', u'stop'] 1883 1884``clear()`` 1885----------- 1886 1887``Tag.clear()`` removes the contents of a tag:: 1888 1889 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1890 soup = BeautifulSoup(markup) 1891 tag = soup.a 1892 1893 tag.clear() 1894 tag 1895 # <a href="http://example.com/"></a> 1896 1897``extract()`` 1898------------- 1899 1900``PageElement.extract()`` removes a tag or string from the tree. It 1901returns the tag or string that was extracted:: 1902 1903 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1904 soup = BeautifulSoup(markup) 1905 a_tag = soup.a 1906 1907 i_tag = soup.i.extract() 1908 1909 a_tag 1910 # <a href="http://example.com/">I linked to</a> 1911 1912 i_tag 1913 # <i>example.com</i> 1914 1915 print(i_tag.parent) 1916 None 1917 1918At this point you effectively have two parse trees: one rooted at the 1919``BeautifulSoup`` object you used to parse the document, and one rooted 1920at the tag that was extracted. You can go on to call ``extract`` on 1921a child of the element you extracted:: 1922 1923 my_string = i_tag.string.extract() 1924 my_string 1925 # u'example.com' 1926 1927 print(my_string.parent) 1928 # None 1929 i_tag 1930 # <i></i> 1931 1932 1933``decompose()`` 1934--------------- 1935 1936``Tag.decompose()`` removes a tag from the tree, then `completely 1937destroys it and its contents`:: 1938 1939 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1940 soup = BeautifulSoup(markup) 1941 a_tag = soup.a 1942 1943 soup.i.decompose() 1944 1945 a_tag 1946 # <a href="http://example.com/">I linked to</a> 1947 1948 1949.. _replace_with: 1950 1951``replace_with()`` 1952------------------ 1953 1954``PageElement.replace_with()`` removes a tag or string from the tree, 1955and replaces it with the tag or string of your choice:: 1956 1957 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1958 soup = BeautifulSoup(markup) 1959 a_tag = soup.a 1960 1961 new_tag = soup.new_tag("b") 1962 new_tag.string = "example.net" 1963 a_tag.i.replace_with(new_tag) 1964 1965 a_tag 1966 # <a href="http://example.com/">I linked to <b>example.net</b></a> 1967 1968``replace_with()`` returns the tag or string that was replaced, so 1969that you can examine it or add it back to another part of the tree. 1970 1971``wrap()`` 1972---------- 1973 1974``PageElement.wrap()`` wraps an element in the tag you specify. It 1975returns the new wrapper:: 1976 1977 soup = BeautifulSoup("<p>I wish I was bold.</p>") 1978 soup.p.string.wrap(soup.new_tag("b")) 1979 # <b>I wish I was bold.</b> 1980 1981 soup.p.wrap(soup.new_tag("div") 1982 # <div><p><b>I wish I was bold.</b></p></div> 1983 1984This method is new in Beautiful Soup 4.0.5. 1985 1986``unwrap()`` 1987--------------------------- 1988 1989``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with 1990whatever's inside that tag. It's good for stripping out markup:: 1991 1992 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 1993 soup = BeautifulSoup(markup) 1994 a_tag = soup.a 1995 1996 a_tag.i.unwrap() 1997 a_tag 1998 # <a href="http://example.com/">I linked to example.com</a> 1999 2000Like ``replace_with()``, ``unwrap()`` returns the tag 2001that was replaced. 2002 2003Output 2004====== 2005 2006.. _.prettyprinting: 2007 2008Pretty-printing 2009--------------- 2010 2011The ``prettify()`` method will turn a Beautiful Soup parse tree into a 2012nicely formatted Unicode string, with each HTML/XML tag on its own line:: 2013 2014 markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' 2015 soup = BeautifulSoup(markup) 2016 soup.prettify() 2017 # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' 2018 2019 print(soup.prettify()) 2020 # <html> 2021 # <head> 2022 # </head> 2023 # <body> 2024 # <a href="http://example.com/"> 2025 # I linked to 2026 # <i> 2027 # example.com 2028 # </i> 2029 # </a> 2030 # </body> 2031 # </html> 2032 2033You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, 2034or on any of its ``Tag`` objects:: 2035 2036 print(soup.a.prettify()) 2037 # <a href="http://example.com/"> 2038 # I linked to 2039 # <i> 2040 # example.com 2041 # </i> 2042 # </a> 2043 2044Non-pretty printing 2045------------------- 2046 2047If you just want a string, with no fancy formatting, you can call 2048``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` 2049within it:: 2050 2051 str(soup) 2052 # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' 2053 2054 unicode(soup.a) 2055 # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' 2056 2057The ``str()`` function returns a string encoded in UTF-8. See 2058`Encodings`_ for other options. 2059 2060You can also call ``encode()`` to get a bytestring, and ``decode()`` 2061to get Unicode. 2062 2063.. _output_formatters: 2064 2065Output formatters 2066----------------- 2067 2068If you give Beautiful Soup a document that contains HTML entities like 2069"&lquot;", they'll be converted to Unicode characters:: 2070 2071 soup = BeautifulSoup("“Dammit!” he said.") 2072 unicode(soup) 2073 # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' 2074 2075If you then convert the document to a string, the Unicode characters 2076will be encoded as UTF-8. You won't get the HTML entities back:: 2077 2078 str(soup) 2079 # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' 2080 2081By default, the only characters that are escaped upon output are bare 2082ampersands and angle brackets. These get turned into "&", "<", 2083and ">", so that Beautiful Soup doesn't inadvertently generate 2084invalid HTML or XML:: 2085 2086 soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>") 2087 soup.p 2088 # <p>The law firm of Dewey, Cheatem, & Howe</p> 2089 2090 soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') 2091 soup.a 2092 # <a href="http://example.com/?foo=val1&bar=val2">A link</a> 2093 2094You can change this behavior by providing a value for the 2095``formatter`` argument to ``prettify()``, ``encode()``, or 2096``decode()``. Beautiful Soup recognizes four possible values for 2097``formatter``. 2098 2099The default is ``formatter="minimal"``. Strings will only be processed 2100enough to ensure that Beautiful Soup generates valid HTML/XML:: 2101 2102 french = "<p>Il a dit <<Sacré bleu!>></p>" 2103 soup = BeautifulSoup(french) 2104 print(soup.prettify(formatter="minimal")) 2105 # <html> 2106 # <body> 2107 # <p> 2108 # Il a dit <<Sacré bleu!>> 2109 # </p> 2110 # </body> 2111 # </html> 2112 2113If you pass in ``formatter="html"``, Beautiful Soup will convert 2114Unicode characters to HTML entities whenever possible:: 2115 2116 print(soup.prettify(formatter="html")) 2117 # <html> 2118 # <body> 2119 # <p> 2120 # Il a dit <<Sacré bleu!>> 2121 # </p> 2122 # </body> 2123 # </html> 2124 2125If you pass in ``formatter=None``, Beautiful Soup will not modify 2126strings at all on output. This is the fastest option, but it may lead 2127to Beautiful Soup generating invalid HTML/XML, as in these examples:: 2128 2129 print(soup.prettify(formatter=None)) 2130 # <html> 2131 # <body> 2132 # <p> 2133 # Il a dit <<Sacré bleu!>> 2134 # </p> 2135 # </body> 2136 # </html> 2137 2138 link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') 2139 print(link_soup.a.encode(formatter=None)) 2140 # <a href="http://example.com/?foo=val1&bar=val2">A link</a> 2141 2142Finally, if you pass in a function for ``formatter``, Beautiful Soup 2143will call that function once for every string and attribute value in 2144the document. You can do whatever you want in this function. Here's a 2145formatter that converts strings to uppercase and does absolutely 2146nothing else:: 2147 2148 def uppercase(str): 2149 return str.upper() 2150 2151 print(soup.prettify(formatter=uppercase)) 2152 # <html> 2153 # <body> 2154 # <p> 2155 # IL A DIT <<SACRÉ BLEU!>> 2156 # </p> 2157 # </body> 2158 # </html> 2159 2160 print(link_soup.a.prettify(formatter=uppercase)) 2161 # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> 2162 # A LINK 2163 # </a> 2164 2165If you're writing your own function, you should know about the 2166``EntitySubstitution`` class in the ``bs4.dammit`` module. This class 2167implements Beautiful Soup's standard formatters as class methods: the 2168"html" formatter is ``EntitySubstitution.substitute_html``, and the 2169"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can 2170use these functions to simulate ``formatter=html`` or 2171``formatter==minimal``, but then do something extra. 2172 2173Here's an example that replaces Unicode characters with HTML entities 2174whenever possible, but `also` converts all strings to uppercase:: 2175 2176 from bs4.dammit import EntitySubstitution 2177 def uppercase_and_substitute_html_entities(str): 2178 return EntitySubstitution.substitute_html(str.upper()) 2179 2180 print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) 2181 # <html> 2182 # <body> 2183 # <p> 2184 # IL A DIT <<SACRÉ BLEU!>> 2185 # </p> 2186 # </body> 2187 # </html> 2188 2189One last caveat: if you create a ``CData`` object, the text inside 2190that object is always presented `exactly as it appears, with no 2191formatting`. Beautiful Soup will call the formatter method, just in 2192case you've written a custom method that counts all the strings in the 2193document or something, but it will ignore the return value:: 2194 2195 from bs4.element import CData 2196 soup = BeautifulSoup("<a></a>") 2197 soup.a.string = CData("one < three") 2198 print(soup.a.prettify(formatter="xml")) 2199 # <a> 2200 # <![CDATA[one < three]]> 2201 # </a> 2202 2203 2204``get_text()`` 2205-------------- 2206 2207If you only want the text part of a document or tag, you can use the 2208``get_text()`` method. It returns all the text in a document or 2209beneath a tag, as a single Unicode string:: 2210 2211 markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' 2212 soup = BeautifulSoup(markup) 2213 2214 soup.get_text() 2215 u'\nI linked to example.com\n' 2216 soup.i.get_text() 2217 u'example.com' 2218 2219You can specify a string to be used to join the bits of text 2220together:: 2221 2222 # soup.get_text("|") 2223 u'\nI linked to |example.com|\n' 2224 2225You can tell Beautiful Soup to strip whitespace from the beginning and 2226end of each bit of text:: 2227 2228 # soup.get_text("|", strip=True) 2229 u'I linked to|example.com' 2230 2231But at that point you might want to use the :ref:`.stripped_strings <string-generators>` 2232generator instead, and process the text yourself:: 2233 2234 [text for text in soup.stripped_strings] 2235 # [u'I linked to', u'example.com'] 2236 2237Specifying the parser to use 2238============================ 2239 2240If you just need to parse some HTML, you can dump the markup into the 2241``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful 2242Soup will pick a parser for you and parse the data. But there are a 2243few additional arguments you can pass in to the constructor to change 2244which parser is used. 2245 2246The first argument to the ``BeautifulSoup`` constructor is a string or 2247an open filehandle--the markup you want parsed. The second argument is 2248`how` you'd like the markup parsed. 2249 2250If you don't specify anything, you'll get the best HTML parser that's 2251installed. Beautiful Soup ranks lxml's parser as being the best, then 2252html5lib's, then Python's built-in parser. You can override this by 2253specifying one of the following: 2254 2255* What type of markup you want to parse. Currently supported are 2256 "html", "xml", and "html5". 2257 2258* The name of the parser library you want to use. Currently supported 2259 options are "lxml", "html5lib", and "html.parser" (Python's 2260 built-in HTML parser). 2261 2262The section `Installing a parser`_ contrasts the supported parsers. 2263 2264If you don't have an appropriate parser installed, Beautiful Soup will 2265ignore your request and pick a different parser. Right now, the only 2266supported XML parser is lxml. If you don't have lxml installed, asking 2267for an XML parser won't give you one, and asking for "lxml" won't work 2268either. 2269 2270Differences between parsers 2271--------------------------- 2272 2273Beautiful Soup presents the same interface to a number of different 2274parsers, but each parser is different. Different parsers will create 2275different parse trees from the same document. The biggest differences 2276are between the HTML parsers and the XML parsers. Here's a short 2277document, parsed as HTML:: 2278 2279 BeautifulSoup("<a><b /></a>") 2280 # <html><head></head><body><a><b></b></a></body></html> 2281 2282Since an empty <b /> tag is not valid HTML, the parser turns it into a 2283<b></b> tag pair. 2284 2285Here's the same document parsed as XML (running this requires that you 2286have lxml installed). Note that the empty <b /> tag is left alone, and 2287that the document is given an XML declaration instead of being put 2288into an <html> tag.:: 2289 2290 BeautifulSoup("<a><b /></a>", "xml") 2291 # <?xml version="1.0" encoding="utf-8"?> 2292 # <a><b/></a> 2293 2294There are also differences between HTML parsers. If you give Beautiful 2295Soup a perfectly-formed HTML document, these differences won't 2296matter. One parser will be faster than another, but they'll all give 2297you a data structure that looks exactly like the original HTML 2298document. 2299 2300But if the document is not perfectly-formed, different parsers will 2301give different results. Here's a short, invalid document parsed using 2302lxml's HTML parser. Note that the dangling </p> tag is simply 2303ignored:: 2304 2305 BeautifulSoup("<a></p>", "lxml") 2306 # <html><body><a></a></body></html> 2307 2308Here's the same document parsed using html5lib:: 2309 2310 BeautifulSoup("<a></p>", "html5lib") 2311 # <html><head></head><body><a><p></p></a></body></html> 2312 2313Instead of ignoring the dangling </p> tag, html5lib pairs it with an 2314opening <p> tag. This parser also adds an empty <head> tag to the 2315document. 2316 2317Here's the same document parsed with Python's built-in HTML 2318parser:: 2319 2320 BeautifulSoup("<a></p>", "html.parser") 2321 # <a></a> 2322 2323Like html5lib, this parser ignores the closing </p> tag. Unlike 2324html5lib, this parser makes no attempt to create a well-formed HTML 2325document by adding a <body> tag. Unlike lxml, it doesn't even bother 2326to add an <html> tag. 2327 2328Since the document "<a></p>" is invalid, none of these techniques is 2329the "correct" way to handle it. The html5lib parser uses techniques 2330that are part of the HTML5 standard, so it has the best claim on being 2331the "correct" way, but all three techniques are legitimate. 2332 2333Differences between parsers can affect your script. If you're planning 2334on distributing your script to other people, or running it on multiple 2335machines, you should specify a parser in the ``BeautifulSoup`` 2336constructor. That will reduce the chances that your users parse a 2337document differently from the way you parse it. 2338 2339Encodings 2340========= 2341 2342Any HTML or XML document is written in a specific encoding like ASCII 2343or UTF-8. But when you load that document into Beautiful Soup, you'll 2344discover it's been converted to Unicode:: 2345 2346 markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" 2347 soup = BeautifulSoup(markup) 2348 soup.h1 2349 # <h1>Sacré bleu!</h1> 2350 soup.h1.string 2351 # u'Sacr\xe9 bleu!' 2352 2353It's not magic. (That sure would be nice.) Beautiful Soup uses a 2354sub-library called `Unicode, Dammit`_ to detect a document's encoding 2355and convert it to Unicode. The autodetected encoding is available as 2356the ``.original_encoding`` attribute of the ``BeautifulSoup`` object:: 2357 2358 soup.original_encoding 2359 'utf-8' 2360 2361Unicode, Dammit guesses correctly most of the time, but sometimes it 2362makes mistakes. Sometimes it guesses correctly, but only after a 2363byte-by-byte search of the document that takes a very long time. If 2364you happen to know a document's encoding ahead of time, you can avoid 2365mistakes and delays by passing it to the ``BeautifulSoup`` constructor 2366as ``from_encoding``. 2367 2368Here's a document written in ISO-8859-8. The document is so short that 2369Unicode, Dammit can't get a good lock on it, and misidentifies it as 2370ISO-8859-7:: 2371 2372 markup = b"<h1>\xed\xe5\xec\xf9</h1>" 2373 soup = BeautifulSoup(markup) 2374 soup.h1 2375 <h1>νεμω</h1> 2376 soup.original_encoding 2377 'ISO-8859-7' 2378 2379We can fix this by passing in the correct ``from_encoding``:: 2380 2381 soup = BeautifulSoup(markup, from_encoding="iso-8859-8") 2382 soup.h1 2383 <h1>םולש</h1> 2384 soup.original_encoding 2385 'iso8859-8' 2386 2387In rare cases (usually when a UTF-8 document contains text written in 2388a completely different encoding), the only way to get Unicode may be 2389to replace some characters with the special Unicode character 2390"REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do 2391this, it will set the ``.contains_replacement_characters`` attribute 2392to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This 2393lets you know that the Unicode representation is not an exact 2394representation of the original--some data was lost. If a document 2395contains �, but ``.contains_replacement_characters`` is ``False``, 2396you'll know that the � was there originally (as it is in this 2397paragraph) and doesn't stand in for missing data. 2398 2399Output encoding 2400--------------- 2401 2402When you write out a document from Beautiful Soup, you get a UTF-8 2403document, even if the document wasn't in UTF-8 to begin with. Here's a 2404document written in the Latin-1 encoding:: 2405 2406 markup = b''' 2407 <html> 2408 <head> 2409 <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> 2410 </head> 2411 <body> 2412 <p>Sacr\xe9 bleu!</p> 2413 </body> 2414 </html> 2415 ''' 2416 2417 soup = BeautifulSoup(markup) 2418 print(soup.prettify()) 2419 # <html> 2420 # <head> 2421 # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> 2422 # </head> 2423 # <body> 2424 # <p> 2425 # Sacré bleu! 2426 # </p> 2427 # </body> 2428 # </html> 2429 2430Note that the <meta> tag has been rewritten to reflect the fact that 2431the document is now in UTF-8. 2432 2433If you don't want UTF-8, you can pass an encoding into ``prettify()``:: 2434 2435 print(soup.prettify("latin-1")) 2436 # <html> 2437 # <head> 2438 # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> 2439 # ... 2440 2441You can also call encode() on the ``BeautifulSoup`` object, or any 2442element in the soup, just as if it were a Python string:: 2443 2444 soup.p.encode("latin-1") 2445 # '<p>Sacr\xe9 bleu!</p>' 2446 2447 soup.p.encode("utf-8") 2448 # '<p>Sacr\xc3\xa9 bleu!</p>' 2449 2450Any characters that can't be represented in your chosen encoding will 2451be converted into numeric XML entity references. Here's a document 2452that includes the Unicode character SNOWMAN:: 2453 2454 markup = u"<b>\N{SNOWMAN}</b>" 2455 snowman_soup = BeautifulSoup(markup) 2456 tag = snowman_soup.b 2457 2458The SNOWMAN character can be part of a UTF-8 document (it looks like 2459☃), but there's no representation for that character in ISO-Latin-1 or 2460ASCII, so it's converted into "☃" for those encodings:: 2461 2462 print(tag.encode("utf-8")) 2463 # <b>☃</b> 2464 2465 print tag.encode("latin-1") 2466 # <b>☃</b> 2467 2468 print tag.encode("ascii") 2469 # <b>☃</b> 2470 2471Unicode, Dammit 2472--------------- 2473 2474You can use Unicode, Dammit without using Beautiful Soup. It's useful 2475whenever you have data in an unknown encoding and you just want it to 2476become Unicode:: 2477 2478 from bs4 import UnicodeDammit 2479 dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") 2480 print(dammit.unicode_markup) 2481 # Sacré bleu! 2482 dammit.original_encoding 2483 # 'utf-8' 2484 2485Unicode, Dammit's guesses will get a lot more accurate if you install 2486the ``chardet`` or ``cchardet`` Python libraries. The more data you 2487give Unicode, Dammit, the more accurately it will guess. If you have 2488your own suspicions as to what the encoding might be, you can pass 2489them in as a list:: 2490 2491 dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) 2492 print(dammit.unicode_markup) 2493 # Sacré bleu! 2494 dammit.original_encoding 2495 # 'latin-1' 2496 2497Unicode, Dammit has two special features that Beautiful Soup doesn't 2498use. 2499 2500Smart quotes 2501^^^^^^^^^^^^ 2502 2503You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML 2504entities:: 2505 2506 markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>" 2507 2508 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup 2509 # u'<p>I just “love” Microsoft Word’s smart quotes</p>' 2510 2511 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup 2512 # u'<p>I just “love” Microsoft Word’s smart quotes</p>' 2513 2514You can also convert Microsoft smart quotes to ASCII quotes:: 2515 2516 UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup 2517 # u'<p>I just "love" Microsoft Word\'s smart quotes</p>' 2518 2519Hopefully you'll find this feature useful, but Beautiful Soup doesn't 2520use it. Beautiful Soup prefers the default behavior, which is to 2521convert Microsoft smart quotes to Unicode characters along with 2522everything else:: 2523 2524 UnicodeDammit(markup, ["windows-1252"]).unicode_markup 2525 # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>' 2526 2527Inconsistent encodings 2528^^^^^^^^^^^^^^^^^^^^^^ 2529 2530Sometimes a document is mostly in UTF-8, but contains Windows-1252 2531characters such as (again) Microsoft smart quotes. This can happen 2532when a website includes data from multiple sources. You can use 2533``UnicodeDammit.detwingle()`` to turn such a document into pure 2534UTF-8. Here's a simple example:: 2535 2536 snowmen = (u"\N{SNOWMAN}" * 3) 2537 quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") 2538 doc = snowmen.encode("utf8") + quote.encode("windows_1252") 2539 2540This document is a mess. The snowmen are in UTF-8 and the quotes are 2541in Windows-1252. You can display the snowmen or the quotes, but not 2542both:: 2543 2544 print(doc) 2545 # ☃☃☃�I like snowmen!� 2546 2547 print(doc.decode("windows-1252")) 2548 # ☃☃☃“I like snowmen!” 2549 2550Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and 2551decoding it as Windows-1252 gives you gibberish. Fortunately, 2552``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8, 2553allowing you to decode it to Unicode and display the snowmen and quote 2554marks simultaneously:: 2555 2556 new_doc = UnicodeDammit.detwingle(doc) 2557 print(new_doc.decode("utf8")) 2558 # ☃☃☃“I like snowmen!” 2559 2560``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252 2561embedded in UTF-8 (or vice versa, I suppose), but this is the most 2562common case. 2563 2564Note that you must know to call ``UnicodeDammit.detwingle()`` on your 2565data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` 2566constructor. Beautiful Soup assumes that a document has a single 2567encoding, whatever it might be. If you pass it a document that 2568contains both UTF-8 and Windows-1252, it's likely to think the whole 2569document is Windows-1252, and the document will come out looking like 2570` ☃☃☃“I like snowmen!”`. 2571 2572``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0. 2573 2574Parsing only part of a document 2575=============================== 2576 2577Let's say you want to use Beautiful Soup look at a document's <a> 2578tags. It's a waste of time and memory to parse the entire document and 2579then go over it again looking for <a> tags. It would be much faster to 2580ignore everything that wasn't an <a> tag in the first place. The 2581``SoupStrainer`` class allows you to choose which parts of an incoming 2582document are parsed. You just create a ``SoupStrainer`` and pass it in 2583to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. 2584 2585(Note that *this feature won't work if you're using the html5lib parser*. 2586If you use html5lib, the whole document will be parsed, no 2587matter what. This is because html5lib constantly rearranges the parse 2588tree as it works, and if some part of the document didn't actually 2589make it into the parse tree, it'll crash. To avoid confusion, in the 2590examples below I'll be forcing Beautiful Soup to use Python's 2591built-in parser.) 2592 2593``SoupStrainer`` 2594---------------- 2595 2596The ``SoupStrainer`` class takes the same arguments as a typical 2597method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs 2598<attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are 2599three ``SoupStrainer`` objects:: 2600 2601 from bs4 import SoupStrainer 2602 2603 only_a_tags = SoupStrainer("a") 2604 2605 only_tags_with_id_link2 = SoupStrainer(id="link2") 2606 2607 def is_short_string(string): 2608 return len(string) < 10 2609 2610 only_short_strings = SoupStrainer(text=is_short_string) 2611 2612I'm going to bring back the "three sisters" document one more time, 2613and we'll see what the document looks like when it's parsed with these 2614three ``SoupStrainer`` objects:: 2615 2616 html_doc = """ 2617 <html><head><title>The Dormouse's story</title></head> 2618 2619 <p class="title"><b>The Dormouse's story</b></p> 2620 2621 <p class="story">Once upon a time there were three little sisters; and their names were 2622 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 2623 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 2624 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 2625 and they lived at the bottom of a well.</p> 2626 2627 <p class="story">...</p> 2628 """ 2629 2630 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) 2631 # <a class="sister" href="http://example.com/elsie" id="link1"> 2632 # Elsie 2633 # </a> 2634 # <a class="sister" href="http://example.com/lacie" id="link2"> 2635 # Lacie 2636 # </a> 2637 # <a class="sister" href="http://example.com/tillie" id="link3"> 2638 # Tillie 2639 # </a> 2640 2641 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) 2642 # <a class="sister" href="http://example.com/lacie" id="link2"> 2643 # Lacie 2644 # </a> 2645 2646 print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) 2647 # Elsie 2648 # , 2649 # Lacie 2650 # and 2651 # Tillie 2652 # ... 2653 # 2654 2655You can also pass a ``SoupStrainer`` into any of the methods covered 2656in `Searching the tree`_. This probably isn't terribly useful, but I 2657thought I'd mention it:: 2658 2659 soup = BeautifulSoup(html_doc) 2660 soup.find_all(only_short_strings) 2661 # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', 2662 # u'\n\n', u'...', u'\n'] 2663 2664Troubleshooting 2665=============== 2666 2667.. _diagnose: 2668 2669``diagnose()`` 2670-------------- 2671 2672If you're having trouble understanding what Beautiful Soup does to a 2673document, pass the document into the ``diagnose()`` function. (New in 2674Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing 2675you how different parsers handle the document, and tell you if you're 2676missing a parser that Beautiful Soup could be using:: 2677 2678 from bs4.diagnose import diagnose 2679 data = open("bad.html").read() 2680 diagnose(data) 2681 2682 # Diagnostic running on Beautiful Soup 4.2.0 2683 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) 2684 # I noticed that html5lib is not installed. Installing it may help. 2685 # Found lxml version 2.3.2.0 2686 # 2687 # Trying to parse your data with html.parser 2688 # Here's what html.parser did with the document: 2689 # ... 2690 2691Just looking at the output of diagnose() may show you how to solve the 2692problem. Even if not, you can paste the output of ``diagnose()`` when 2693asking for help. 2694 2695Errors when parsing a document 2696------------------------------ 2697 2698There are two different kinds of parse errors. There are crashes, 2699where you feed a document to Beautiful Soup and it raises an 2700exception, usually an ``HTMLParser.HTMLParseError``. And there is 2701unexpected behavior, where a Beautiful Soup parse tree looks a lot 2702different than the document used to create it. 2703 2704Almost none of these problems turn out to be problems with Beautiful 2705Soup. This is not because Beautiful Soup is an amazingly well-written 2706piece of software. It's because Beautiful Soup doesn't include any 2707parsing code. Instead, it relies on external parsers. If one parser 2708isn't working on a certain document, the best solution is to try a 2709different parser. See `Installing a parser`_ for details and a parser 2710comparison. 2711 2712The most common parse errors are ``HTMLParser.HTMLParseError: 2713malformed start tag`` and ``HTMLParser.HTMLParseError: bad end 2714tag``. These are both generated by Python's built-in HTML parser 2715library, and the solution is to :ref:`install lxml or 2716html5lib. <parser-installation>` 2717 2718The most common type of unexpected behavior is that you can't find a 2719tag that you know is in the document. You saw it going in, but 2720``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is 2721another common problem with Python's built-in HTML parser, which 2722sometimes skips tags it doesn't understand. Again, the solution is to 2723:ref:`install lxml or html5lib. <parser-installation>` 2724 2725Version mismatch problems 2726------------------------- 2727 2728* ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME = 2729 u'[document]'``): Caused by running the Python 2 version of 2730 Beautiful Soup under Python 3, without converting the code. 2731 2732* ``ImportError: No module named HTMLParser`` - Caused by running the 2733 Python 2 version of Beautiful Soup under Python 3. 2734 2735* ``ImportError: No module named html.parser`` - Caused by running the 2736 Python 3 version of Beautiful Soup under Python 2. 2737 2738* ``ImportError: No module named BeautifulSoup`` - Caused by running 2739 Beautiful Soup 3 code on a system that doesn't have BS3 2740 installed. Or, by writing Beautiful Soup 4 code without knowing that 2741 the package name has changed to ``bs4``. 2742 2743* ``ImportError: No module named bs4`` - Caused by running Beautiful 2744 Soup 4 code on a system that doesn't have BS4 installed. 2745 2746.. _parsing-xml: 2747 2748Parsing XML 2749----------- 2750 2751By default, Beautiful Soup parses documents as HTML. To parse a 2752document as XML, pass in "xml" as the second argument to the 2753``BeautifulSoup`` constructor:: 2754 2755 soup = BeautifulSoup(markup, "xml") 2756 2757You'll need to :ref:`have lxml installed <parser-installation>`. 2758 2759Other parser problems 2760--------------------- 2761 2762* If your script works on one computer but not another, it's probably 2763 because the two computers have different parser libraries 2764 available. For example, you may have developed the script on a 2765 computer that has lxml installed, and then tried to run it on a 2766 computer that only has html5lib installed. See `Differences between 2767 parsers`_ for why this matters, and fix the problem by mentioning a 2768 specific parser library in the ``BeautifulSoup`` constructor. 2769 2770* Because `HTML tags and attributes are case-insensitive 2771 <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML 2772 parsers convert tag and attribute names to lowercase. That is, the 2773 markup <TAG></TAG> is converted to <tag></tag>. If you want to 2774 preserve mixed-case or uppercase tags and attributes, you'll need to 2775 :ref:`parse the document as XML. <parsing-xml>` 2776 2777.. _misc: 2778 2779Miscellaneous 2780------------- 2781 2782* ``UnicodeEncodeError: 'charmap' codec can't encode character 2783 u'\xfoo' in position bar`` (or just about any other 2784 ``UnicodeEncodeError``) - This is not a problem with Beautiful Soup. 2785 This problem shows up in two main situations. First, when you try to 2786 print a Unicode character that your console doesn't know how to 2787 display. (See `this page on the Python wiki 2788 <http://wiki.python.org/moin/PrintFails>`_ for help.) Second, when 2789 you're writing to a file and you pass in a Unicode character that's 2790 not supported by your default encoding. In this case, the simplest 2791 solution is to explicitly encode the Unicode string into UTF-8 with 2792 ``u.encode("utf8")``. 2793 2794* ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the 2795 tag in question doesn't define the ``attr`` attribute. The most 2796 common errors are ``KeyError: 'href'`` and ``KeyError: 2797 'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is 2798 defined, just as you would with a Python dictionary. 2799 2800* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This 2801 usually happens because you expected ``find_all()`` to return a 2802 single tag or string. But ``find_all()`` returns a _list_ of tags 2803 and strings--a ``ResultSet`` object. You need to iterate over the 2804 list and look at the ``.foo`` of each one. Or, if you really only 2805 want one result, you need to use ``find()`` instead of 2806 ``find_all()``. 2807 2808* ``AttributeError: 'NoneType' object has no attribute 'foo'`` - This 2809 usually happens because you called ``find()`` and then tried to 2810 access the `.foo`` attribute of the result. But in your case, 2811 ``find()`` didn't find anything, so it returned ``None``, instead of 2812 returning a tag or a string. You need to figure out why your 2813 ``find()`` call isn't returning anything. 2814 2815Improving Performance 2816--------------------- 2817 2818Beautiful Soup will never be as fast as the parsers it sits on top 2819of. If response time is critical, if you're paying for computer time 2820by the hour, or if there's any other reason why computer time is more 2821valuable than programmer time, you should forget about Beautiful Soup 2822and work directly atop `lxml <http://lxml.de/>`_. 2823 2824That said, there are things you can do to speed up Beautiful Soup. If 2825you're not using lxml as the underlying parser, my advice is to 2826:ref:`start <parser-installation>`. Beautiful Soup parses documents 2827significantly faster using lxml than using html.parser or html5lib. 2828 2829You can speed up encoding detection significantly by installing the 2830`cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. 2831 2832`Parsing only part of a document`_ won't save you much time parsing 2833the document, but it can save a lot of memory, and it'll make 2834`searching` the document much faster. 2835 2836Beautiful Soup 3 2837================ 2838 2839Beautiful Soup 3 is the previous release series, and is no longer 2840being actively developed. It's currently packaged with all major Linux 2841distributions: 2842 2843:kbd:`$ apt-get install python-beautifulsoup` 2844 2845It's also published through PyPi as ``BeautifulSoup``.: 2846 2847:kbd:`$ easy_install BeautifulSoup` 2848 2849:kbd:`$ pip install BeautifulSoup` 2850 2851You can also `download a tarball of Beautiful Soup 3.2.0 2852<http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_. 2853 2854If you ran ``easy_install beautifulsoup`` or ``easy_install 2855BeautifulSoup``, but your code doesn't work, you installed Beautiful 2856Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``. 2857 2858`The documentation for Beautiful Soup 3 is archived online 2859<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If 2860your first language is Chinese, it might be easier for you to read 2861`the Chinese translation of the Beautiful Soup 3 documentation 2862<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_, 2863then read this document to find out about the changes made in 2864Beautiful Soup 4. 2865 2866Porting code to BS4 2867------------------- 2868 2869Most code written against Beautiful Soup 3 will work against Beautiful 2870Soup 4 with one simple change. All you should have to do is change the 2871package name from ``BeautifulSoup`` to ``bs4``. So this:: 2872 2873 from BeautifulSoup import BeautifulSoup 2874 2875becomes this:: 2876 2877 from bs4 import BeautifulSoup 2878 2879* If you get the ``ImportError`` "No module named BeautifulSoup", your 2880 problem is that you're trying to run Beautiful Soup 3 code, but you 2881 only have Beautiful Soup 4 installed. 2882 2883* If you get the ``ImportError`` "No module named bs4", your problem 2884 is that you're trying to run Beautiful Soup 4 code, but you only 2885 have Beautiful Soup 3 installed. 2886 2887Although BS4 is mostly backwards-compatible with BS3, most of its 2888methods have been deprecated and given new names for `PEP 8 compliance 2889<http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other 2890renames and changes, and a few of them break backwards compatibility. 2891 2892Here's what you'll need to know to convert your BS3 code and habits to BS4: 2893 2894You need a parser 2895^^^^^^^^^^^^^^^^^ 2896 2897Beautiful Soup 3 used Python's ``SGMLParser``, a module that was 2898deprecated and removed in Python 3.0. Beautiful Soup 4 uses 2899``html.parser`` by default, but you can plug in lxml or html5lib and 2900use that instead. See `Installing a parser`_ for a comparison. 2901 2902Since ``html.parser`` is not the same parser as ``SGMLParser``, it 2903will treat invalid markup differently. Usually the "difference" is 2904that ``html.parser`` crashes. In that case, you'll need to install 2905another parser. But sometimes ``html.parser`` just creates a different 2906parse tree than ``SGMLParser`` would. If this happens, you may need to 2907update your BS3 scraping code to deal with the new tree. 2908 2909Method names 2910^^^^^^^^^^^^ 2911 2912* ``renderContents`` -> ``encode_contents`` 2913* ``replaceWith`` -> ``replace_with`` 2914* ``replaceWithChildren`` -> ``unwrap`` 2915* ``findAll`` -> ``find_all`` 2916* ``findAllNext`` -> ``find_all_next`` 2917* ``findAllPrevious`` -> ``find_all_previous`` 2918* ``findNext`` -> ``find_next`` 2919* ``findNextSibling`` -> ``find_next_sibling`` 2920* ``findNextSiblings`` -> ``find_next_siblings`` 2921* ``findParent`` -> ``find_parent`` 2922* ``findParents`` -> ``find_parents`` 2923* ``findPrevious`` -> ``find_previous`` 2924* ``findPreviousSibling`` -> ``find_previous_sibling`` 2925* ``findPreviousSiblings`` -> ``find_previous_siblings`` 2926* ``nextSibling`` -> ``next_sibling`` 2927* ``previousSibling`` -> ``previous_sibling`` 2928 2929Some arguments to the Beautiful Soup constructor were renamed for the 2930same reasons: 2931 2932* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)`` 2933* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)`` 2934 2935I renamed one method for compatibility with Python 3: 2936 2937* ``Tag.has_key()`` -> ``Tag.has_attr()`` 2938 2939I renamed one attribute to use more accurate terminology: 2940 2941* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element`` 2942 2943I renamed three attributes to avoid using words that have special 2944meaning to Python. Unlike the others, these changes are *not backwards 2945compatible.* If you used these attributes in BS3, your code will break 2946on BS4 until you change them. 2947 2948* ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup`` 2949* ``Tag.next`` -> ``Tag.next_element`` 2950* ``Tag.previous`` -> ``Tag.previous_element`` 2951 2952Generators 2953^^^^^^^^^^ 2954 2955I gave the generators PEP 8-compliant names, and transformed them into 2956properties: 2957 2958* ``childGenerator()`` -> ``children`` 2959* ``nextGenerator()`` -> ``next_elements`` 2960* ``nextSiblingGenerator()`` -> ``next_siblings`` 2961* ``previousGenerator()`` -> ``previous_elements`` 2962* ``previousSiblingGenerator()`` -> ``previous_siblings`` 2963* ``recursiveChildGenerator()`` -> ``descendants`` 2964* ``parentGenerator()`` -> ``parents`` 2965 2966So instead of this:: 2967 2968 for parent in tag.parentGenerator(): 2969 ... 2970 2971You can write this:: 2972 2973 for parent in tag.parents: 2974 ... 2975 2976(But the old code will still work.) 2977 2978Some of the generators used to yield ``None`` after they were done, and 2979then stop. That was a bug. Now the generators just stop. 2980 2981There are two new generators, :ref:`.strings and 2982.stripped_strings <string-generators>`. ``.strings`` yields 2983NavigableString objects, and ``.stripped_strings`` yields Python 2984strings that have had whitespace stripped. 2985 2986XML 2987^^^ 2988 2989There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To 2990parse XML you pass in "xml" as the second argument to the 2991``BeautifulSoup`` constructor. For the same reason, the 2992``BeautifulSoup`` constructor no longer recognizes the ``isHTML`` 2993argument. 2994 2995Beautiful Soup's handling of empty-element XML tags has been 2996improved. Previously when you parsed XML you had to explicitly say 2997which tags were considered empty-element tags. The ``selfClosingTags`` 2998argument to the constructor is no longer recognized. Instead, 2999Beautiful Soup considers any empty tag to be an empty-element tag. If 3000you add a child to an empty-element tag, it stops being an 3001empty-element tag. 3002 3003Entities 3004^^^^^^^^ 3005 3006An incoming HTML or XML entity is always converted into the 3007corresponding Unicode character. Beautiful Soup 3 had a number of 3008overlapping ways of dealing with entities, which have been 3009removed. The ``BeautifulSoup`` constructor no longer recognizes the 3010``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, 3011Dammit`_ still has ``smart_quotes_to``, but its default is now to turn 3012smart quotes into Unicode.) The constants ``HTML_ENTITIES``, 3013``XML_ENTITIES``, and ``XHTML_ENTITIES`` have been removed, since they 3014configure a feature (transforming some but not all entities into 3015Unicode characters) that no longer exists. 3016 3017If you want to turn Unicode characters back into HTML entities on 3018output, rather than turning them into UTF-8 characters, you need to 3019use an :ref:`output formatter <output_formatters>`. 3020 3021Miscellaneous 3022^^^^^^^^^^^^^ 3023 3024:ref:`Tag.string <.string>` now operates recursively. If tag A 3025contains a single tag B and nothing else, then A.string is the same as 3026B.string. (Previously, it was None.) 3027 3028`Multi-valued attributes`_ like ``class`` have lists of strings as 3029their values, not strings. This may affect the way you search by CSS 3030class. 3031 3032If you pass one of the ``find*`` methods both :ref:`text <text>` `and` 3033a tag-specific argument like :ref:`name <name>`, Beautiful Soup will 3034search for tags that match your tag-specific criteria and whose 3035:ref:`Tag.string <.string>` matches your value for :ref:`text 3036<text>`. It will `not` find the strings themselves. Previously, 3037Beautiful Soup ignored the tag-specific arguments and looked for 3038strings. 3039 3040The ``BeautifulSoup`` constructor no longer recognizes the 3041`markupMassage` argument. It's now the parser's responsibility to 3042handle markup correctly. 3043 3044The rarely-used alternate parser classes like 3045``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been 3046removed. It's now the parser's decision how to handle ambiguous 3047markup. 3048 3049The ``prettify()`` method now returns a Unicode string, not a bytestring. 3050