1= 4.3.2 (20131002) = 2 3* Fixed a bug in which short Unicode input was improperly encoded to 4 ASCII when checking whether or not it was the name of a file on 5 disk. [bug=1227016] 6 7* Fixed a crash when a short input contains data not valid in 8 filenames. [bug=1232604] 9 10* Fixed a bug that caused Unicode data put into UnicodeDammit to 11 return None instead of the original data. [bug=1214983] 12 13* Combined two tests to stop a spurious test failure when tests are 14 run by nosetests. [bug=1212445] 15 16= 4.3.1 (20130815) = 17 18* Fixed yet another problem with the html5lib tree builder, caused by 19 html5lib's tendency to rearrange the tree during 20 parsing. [bug=1189267] 21 22* Fixed a bug that caused the optimized version of find_all() to 23 return nothing. [bug=1212655] 24 25= 4.3.0 (20130812) = 26 27* Instead of converting incoming data to Unicode and feeding it to the 28 lxml tree builder in chunks, Beautiful Soup now makes successive 29 guesses at the encoding of the incoming data, and tells lxml to 30 parse the data as that encoding. Giving lxml more control over the 31 parsing process improves performance and avoids a number of bugs and 32 issues with the lxml parser which had previously required elaborate 33 workarounds: 34 35 - An issue in which lxml refuses to parse Unicode strings on some 36 systems. [bug=1180527] 37 38 - A returning bug that truncated documents longer than a (very 39 small) size. [bug=963880] 40 41 - A returning bug in which extra spaces were added to a document if 42 the document defined a charset other than UTF-8. [bug=972466] 43 44 This required a major overhaul of the tree builder architecture. If 45 you wrote your own tree builder and didn't tell me, you'll need to 46 modify your prepare_markup() method. 47 48* The UnicodeDammit code that makes guesses at encodings has been 49 split into its own class, EncodingDetector. A lot of apparently 50 redundant code has been removed from Unicode, Dammit, and some 51 undocumented features have also been removed. 52 53* Beautiful Soup will issue a warning if instead of markup you pass it 54 a URL or the name of a file on disk (a common beginner's mistake). 55 56* A number of optimizations improve the performance of the lxml tree 57 builder by about 33%, the html.parser tree builder by about 20%, and 58 the html5lib tree builder by about 15%. 59 60* All find_all calls should now return a ResultSet object. Patch by 61 Aaron DeVore. [bug=1194034] 62 63= 4.2.1 (20130531) = 64 65* The default XML formatter will now replace ampersands even if they 66 appear to be part of entities. That is, "<" will become 67 "&lt;". The old code was left over from Beautiful Soup 3, which 68 didn't always turn entities into Unicode characters. 69 70 If you really want the old behavior (maybe because you add new 71 strings to the tree, those strings include entities, and you want 72 the formatter to leave them alone on output), it can be found in 73 EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] 74 75* Gave new_string() the ability to create subclasses of 76 NavigableString. [bug=1181986] 77 78* Fixed another bug by which the html5lib tree builder could create a 79 disconnected tree. [bug=1182089] 80 81* The .previous_element of a BeautifulSoup object is now always None, 82 not the last element to be parsed. [bug=1182089] 83 84* Fixed test failures when lxml is not installed. [bug=1181589] 85 86* html5lib now supports Python 3. Fixed some Python 2-specific 87 code in the html5lib test suite. [bug=1181624] 88 89* The html.parser treebuilder can now handle numeric attributes in 90 text when the hexidecimal name of the attribute starts with a 91 capital X. Patch by Tim Shirley. [bug=1186242] 92 93= 4.2.0 (20130514) = 94 95* The Tag.select() method now supports a much wider variety of CSS 96 selectors. 97 98 - Added support for the adjacent sibling combinator (+) and the 99 general sibling combinator (~). Tests by "liquider". [bug=1082144] 100 101 - The combinators (>, +, and ~) can now combine with any supported 102 selector, not just one that selects based on tag name. 103 104 - Added limited support for the "nth-of-type" pseudo-class. Code 105 by Sven Slootweg. [bug=1109952] 106 107* The BeautifulSoup class is now aliased to "_s" and "_soup", making 108 it quicker to type the import statement in an interactive session: 109 110 from bs4 import _s 111 or 112 from bs4 import _soup 113 114 The alias may change in the future, so don't use this in code you're 115 going to run more than once. 116 117* Added the 'diagnose' submodule, which includes several useful 118 functions for reporting problems and doing tech support. 119 120 - diagnose(data) tries the given markup on every installed parser, 121 reporting exceptions and displaying successes. If a parser is not 122 installed, diagnose() mentions this fact. 123 124 - lxml_trace(data, html=True) runs the given markup through lxml's 125 XML parser or HTML parser, and prints out the parser events as 126 they happen. This helps you quickly determine whether a given 127 problem occurs in lxml code or Beautiful Soup code. 128 129 - htmlparser_trace(data) is the same thing, but for Python's 130 built-in HTMLParser class. 131 132* In an HTML document, the contents of a <script> or <style> tag will 133 no longer undergo entity substitution by default. XML documents work 134 the same way they did before. [bug=1085953] 135 136* Methods like get_text() and properties like .strings now only give 137 you strings that are visible in the document--no comments or 138 processing commands. [bug=1050164] 139 140* The prettify() method now leaves the contents of <pre> tags 141 alone. [bug=1095654] 142 143* Fix a bug in the html5lib treebuilder which sometimes created 144 disconnected trees. [bug=1039527] 145 146* Fix a bug in the lxml treebuilder which crashed when a tag included 147 an attribute from the predefined "xml:" namespace. [bug=1065617] 148 149* Fix a bug by which keyword arguments to find_parent() were not 150 being passed on. [bug=1126734] 151 152* Stop a crash when unwisely messing with a tag that's been 153 decomposed. [bug=1097699] 154 155* Now that lxml's segfault on invalid doctype has been fixed, fixed a 156 corresponding problem on the Beautiful Soup end that was previously 157 invisible. [bug=984936] 158 159* Fixed an exception when an overspecified CSS selector didn't match 160 anything. Code by Stefaan Lippens. [bug=1168167] 161 162= 4.1.3 (20120820) = 163 164* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious 165 test failure caused by the lousy HTMLParser in those 166 versions. [bug=1038503] 167 168* Raise a more specific error (FeatureNotFound) when a requested 169 parser or parser feature is not installed. Raise NotImplementedError 170 instead of ValueError when the user calls insert_before() or 171 insert_after() on the BeautifulSoup object itself. Patch by Aaron 172 Devore. [bug=1038301] 173 174= 4.1.2 (20120817) = 175 176* As per PEP-8, allow searching by CSS class using the 'class_' 177 keyword argument. [bug=1037624] 178 179* Display namespace prefixes for namespaced attribute names, instead of 180 the fully-qualified names given by the lxml parser. [bug=1037597] 181 182* Fixed a crash on encoding when an attribute name contained 183 non-ASCII characters. 184 185* When sniffing encodings, if the cchardet library is installed, 186 Beautiful Soup uses it instead of chardet. cchardet is much 187 faster. [bug=1020748] 188 189* Use logging.warning() instead of warning.warn() to notify the user 190 that characters were replaced with REPLACEMENT 191 CHARACTER. [bug=1013862] 192 193= 4.1.1 (20120703) = 194 195* Fixed an html5lib tree builder crash which happened when html5lib 196 moved a tag with a multivalued attribute from one part of the tree 197 to another. [bug=1019603] 198 199* Correctly display closing tags with an XML namespace declared. Patch 200 by Andreas Kostyrka. [bug=1019635] 201 202* Fixed a typo that made parsing significantly slower than it should 203 have been, and also waited too long to close tags with XML 204 namespaces. [bug=1020268] 205 206* get_text() now returns an empty Unicode string if there is no text, 207 rather than an empty bytestring. [bug=1020387] 208 209= 4.1.0 (20120529) = 210 211* Added experimental support for fixing Windows-1252 characters 212 embedded in UTF-8 documents. (UnicodeDammit.detwingle()) 213 214* Fixed the handling of " with the built-in parser. [bug=993871] 215 216* Comments, processing instructions, document type declarations, and 217 markup declarations are now treated as preformatted strings, the way 218 CData blocks are. [bug=1001025] 219 220* Fixed a bug with the lxml treebuilder that prevented the user from 221 adding attributes to a tag that didn't originally have 222 attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. 223 224* Fixed some edge-case bugs having to do with inserting an element 225 into a tag it's already inside, and replacing one of a tag's 226 children with another. [bug=997529] 227 228* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] 229 230 This caused a major refactoring of the search code. All the tests 231 pass, but it's possible that some searches will behave differently. 232 233= 4.0.5 (20120427) = 234 235* Added a new method, wrap(), which wraps an element in a tag. 236 237* Renamed replace_with_children() to unwrap(), which is easier to 238 understand and also the jQuery name of the function. 239 240* Made encoding substitution in <meta> tags completely transparent (no 241 more %SOUP-ENCODING%). 242 243* Fixed a bug in decoding data that contained a byte-order mark, such 244 as data encoded in UTF-16LE. [bug=988980] 245 246* Fixed a bug that made the HTMLParser treebuilder generate XML 247 definitions ending with two question marks instead of 248 one. [bug=984258] 249 250* Upon document generation, CData objects are no longer run through 251 the formatter. [bug=988905] 252 253* The test suite now passes when lxml is not installed, whether or not 254 html5lib is installed. [bug=987004] 255 256* Print a warning on HTMLParseErrors to let people know they should 257 install a better parser library. 258 259= 4.0.4 (20120416) = 260 261* Fixed a bug that sometimes created disconnected trees. 262 263* Fixed a bug with the string setter that moved a string around the 264 tree instead of copying it. [bug=983050] 265 266* Attribute values are now run through the provided output formatter. 267 Previously they were always run through the 'minimal' formatter. In 268 the future I may make it possible to specify different formatters 269 for attribute values and strings, but for now, consistent behavior 270 is better than inconsistent behavior. [bug=980237] 271 272* Added the missing renderContents method from Beautiful Soup 3. Also 273 added an encode_contents() method to go along with decode_contents(). 274 275* Give a more useful error when the user tries to run the Python 2 276 version of BS under Python 3. 277 278* UnicodeDammit can now convert Microsoft smart quotes to ASCII with 279 UnicodeDammit(markup, smart_quotes_to="ascii"). 280 281= 4.0.3 (20120403) = 282 283* Fixed a typo that caused some versions of Python 3 to convert the 284 Beautiful Soup codebase incorrectly. 285 286* Got rid of the 4.0.2 workaround for HTML documents--it was 287 unnecessary and the workaround was triggering a (possibly different, 288 but related) bug in lxml. [bug=972466] 289 290= 4.0.2 (20120326) = 291 292* Worked around a possible bug in lxml that prevents non-tiny XML 293 documents from being parsed. [bug=963880, bug=963936] 294 295* Fixed a bug where specifying `text` while also searching for a tag 296 only worked if `text` wanted an exact string match. [bug=955942] 297 298= 4.0.1 (20120314) = 299 300* This is the first official release of Beautiful Soup 4. There is no 301 4.0.0 release, to eliminate any possibility that packaging software 302 might treat "4.0.0" as being an earlier version than "4.0.0b10". 303 304* Brought BS up to date with the latest release of soupselect, adding 305 CSS selector support for direct descendant matches and multiple CSS 306 class matches. 307 308= 4.0.0b10 (20120302) = 309 310* Added support for simple CSS selectors, taken from the soupselect project. 311 312* Fixed a crash when using html5lib. [bug=943246] 313 314* In HTML5-style <meta charset="foo"> tags, the value of the "charset" 315 attribute is now replaced with the appropriate encoding on 316 output. [bug=942714] 317 318* Fixed a bug that caused calling a tag to sometimes call find_all() 319 with the wrong arguments. [bug=944426] 320 321* For backwards compatibility, brought back the BeautifulStoneSoup 322 class as a deprecated wrapper around BeautifulSoup. 323 324= 4.0.0b9 (20120228) = 325 326* Fixed the string representation of DOCTYPEs that have both a public 327 ID and a system ID. 328 329* Fixed the generated XML declaration. 330 331* Renamed Tag.nsprefix to Tag.prefix, for consistency with 332 NamespacedAttribute. 333 334* Fixed a test failure that occured on Python 3.x when chardet was 335 installed. 336 337* Made prettify() return Unicode by default, so it will look nice on 338 Python 3 when passed into print(). 339 340= 4.0.0b8 (20120224) = 341 342* All tree builders now preserve namespace information in the 343 documents they parse. If you use the html5lib parser or lxml's XML 344 parser, you can access the namespace URL for a tag as tag.namespace. 345 346 However, there is no special support for namespace-oriented 347 searching or tree manipulation. When you search the tree, you need 348 to use namespace prefixes exactly as they're used in the original 349 document. 350 351* The string representation of a DOCTYPE always ends in a newline. 352 353* Issue a warning if the user tries to use a SoupStrainer in 354 conjunction with the html5lib tree builder, which doesn't support 355 them. 356 357= 4.0.0b7 (20120223) = 358 359* Upon decoding to string, any characters that can't be represented in 360 your chosen encoding will be converted into numeric XML entity 361 references. 362 363* Issue a warning if characters were replaced with REPLACEMENT 364 CHARACTER during Unicode conversion. 365 366* Restored compatibility with Python 2.6. 367 368* The install process no longer installs docs or auxillary text files. 369 370* It's now possible to deepcopy a BeautifulSoup object created with 371 Python's built-in HTML parser. 372 373* About 100 unit tests that "test" the behavior of various parsers on 374 invalid markup have been removed. Legitimate changes to those 375 parsers caused these tests to fail, indicating that perhaps 376 Beautiful Soup should not test the behavior of foreign 377 libraries. 378 379 The problematic unit tests have been reformulated as informational 380 comparisons generated by the script 381 scripts/demonstrate_parser_differences.py. 382 383 This makes Beautiful Soup compatible with html5lib version 0.95 and 384 future versions of HTMLParser. 385 386= 4.0.0b6 (20120216) = 387 388* Multi-valued attributes like "class" always have a list of values, 389 even if there's only one value in the list. 390 391* Added a number of multi-valued attributes defined in HTML5. 392 393* Stopped generating a space before the slash that closes an 394 empty-element tag. This may come back if I add a special XHTML mode 395 (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty 396 useless. 397 398* Passing text along with tag-specific arguments to a find* method: 399 400 find("a", text="Click here") 401 402 will find tags that contain the given text as their 403 .string. Previously, the tag-specific arguments were ignored and 404 only strings were searched. 405 406* Fixed a bug that caused the html5lib tree builder to build a 407 partially disconnected tree. Generally cleaned up the html5lib tree 408 builder. 409 410* If you restrict a multi-valued attribute like "class" to a string 411 that contains spaces, Beautiful Soup will only consider it a match 412 if the values correspond to that specific string. 413 414= 4.0.0b5 (20120209) = 415 416* Rationalized Beautiful Soup's treatment of CSS class. A tag 417 belonging to multiple CSS classes is treated as having a list of 418 values for the 'class' attribute. Searching for a CSS class will 419 match *any* of the CSS classes. 420 421 This actually affects all attributes that the HTML standard defines 422 as taking multiple values (class, rel, rev, archive, accept-charset, 423 and headers), but 'class' is by far the most common. [bug=41034] 424 425* If you pass anything other than a dictionary as the second argument 426 to one of the find* methods, it'll assume you want to use that 427 object to search against a tag's CSS classes. Previously this only 428 worked if you passed in a string. 429 430* Fixed a bug that caused a crash when you passed a dictionary as an 431 attribute value (possibly because you mistyped "attrs"). [bug=842419] 432 433* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags 434 like <meta charset="utf-8" />. [bug=837268] 435 436* If Unicode, Dammit can't figure out a consistent encoding for a 437 page, it will try each of its guesses again, with errors="replace" 438 instead of errors="strict". This may mean that some data gets 439 replaced with REPLACEMENT CHARACTER, but at least most of it will 440 get turned into Unicode. [bug=754903] 441 442* Patched over a bug in html5lib (?) that was crashing Beautiful Soup 443 on certain kinds of markup. [bug=838800] 444 445* Fixed a bug that wrecked the tree if you replaced an element with an 446 empty string. [bug=728697] 447 448* Improved Unicode, Dammit's behavior when you give it Unicode to 449 begin with. 450 451= 4.0.0b4 (20120208) = 452 453* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() 454 455* BeautifulSoup.new_tag() will follow the rules of whatever 456 tree-builder was used to create the original BeautifulSoup object. A 457 new <p> tag will look like "<p />" if the soup object was created to 458 parse XML, but it will look like "<p></p>" if the soup object was 459 created to parse HTML. 460 461* We pass in strict=False to html.parser on Python 3, greatly 462 improving html.parser's ability to handle bad HTML. 463 464* We also monkeypatch a serious bug in html.parser that made 465 strict=False disastrous on Python 3.2.2. 466 467* Replaced the "substitute_html_entities" argument with the 468 more general "formatter" argument. 469 470* Bare ampersands and angle brackets are always converted to XML 471 entities unless the user prevents it. 472 473* Added PageElement.insert_before() and PageElement.insert_after(), 474 which let you put an element into the parse tree with respect to 475 some other element. 476 477* Raise an exception when the user tries to do something nonsensical 478 like insert a tag into itself. 479 480 481= 4.0.0b3 (20120203) = 482 483Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful 484Soup's custom HTML parser in favor of a system that lets you write a 485little glue code and plug in any HTML or XML parser you want. 486 487Beautiful Soup 4.0 comes with glue code for four parsers: 488 489 * Python's standard HTMLParser (html.parser in Python 3) 490 * lxml's HTML and XML parsers 491 * html5lib's HTML parser 492 493HTMLParser is the default, but I recommend you install lxml if you 494can. 495 496For complete documentation, see the Sphinx documentation in 497bs4/doc/source/. What follows is a summary of the changes from 498Beautiful Soup 3. 499 500=== The module name has changed === 501 502Previously you imported the BeautifulSoup class from a module also 503called BeautifulSoup. To save keystrokes and make it clear which 504version of the API is in use, the module is now called 'bs4': 505 506 >>> from bs4 import BeautifulSoup 507 508=== It works with Python 3 === 509 510Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was 511so bad that it barely worked at all. Beautiful Soup 4 works with 512Python 3, and since its parser is pluggable, you don't sacrifice 513quality. 514 515Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 516support to the finish line. Ezio Melotti is also to thank for greatly 517improving the HTML parser that comes with Python 3.2. 518 519=== CDATA sections are normal text, if they're understood at all. === 520 521Currently, the lxml and html5lib HTML parsers ignore CDATA sections in 522markup: 523 524 <p><![CDATA[foo]]></p> => <p></p> 525 526A future version of html5lib will turn CDATA sections into text nodes, 527but only within tags like <svg> and <math>: 528 529 <svg><![CDATA[foo]]></svg> => <p>foo</p> 530 531The default XML parser (which uses lxml behind the scenes) turns CDATA 532sections into ordinary text elements: 533 534 <p><![CDATA[foo]]></p> => <p>foo</p> 535 536In theory it's possible to preserve the CDATA sections when using the 537XML parser, but I don't see how to get it to work in practice. 538 539=== Miscellaneous other stuff === 540 541If the BeautifulSoup instance has .is_xml set to True, an appropriate 542XML declaration will be emitted when the tree is transformed into a 543string: 544 545 <?xml version="1.0" encoding="utf-8"> 546 <markup> 547 ... 548 </markup> 549 550The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree 551builders set it to False. If you want to parse XHTML with an HTML 552parser, you can set it manually. 553 554 555= 3.2.0 = 556 557The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 558to make it obvious which one you should use. 559 560= 3.1.0 = 561 562A hybrid version that supports 2.4 and can be automatically converted 563to run under Python 3.0. There are three backwards-incompatible 564changes you should be aware of, but no new features or deliberate 565behavior changes. 566 5671. str() may no longer do what you want. This is because the meaning 568of str() inverts between Python 2 and 3; in Python 2 it gives you a 569byte string, in Python 3 it gives you a Unicode string. 570 571The effect of this is that you can't pass an encoding to .__str__ 572anymore. Use encode() to get a string and decode() to get Unicode, and 573you'll be ready (well, readier) for Python 3. 574 5752. Beautiful Soup is now based on HTMLParser rather than SGMLParser, 576which is gone in Python 3. There's some bad HTML that SGMLParser 577handled but HTMLParser doesn't, usually to do with attribute values 578that aren't closed or have brackets inside them: 579 580 <a href="foo</a>, </a><a href="bar">baz</a> 581 <a b="<a>">', '<a b="<a>"></a><a>"></a> 582 583A later version of Beautiful Soup will allow you to plug in different 584parsers to make tradeoffs between speed and the ability to handle bad 585HTML. 586 5873. In Python 3 (but not Python 2), HTMLParser converts entities within 588attributes to the corresponding Unicode characters. In Python 2 it's 589possible to parse this string and leave the é intact. 590 591 <a href="http://crummy.com?sacré&bleu"> 592 593In Python 3, the é is always converted to \xe9 during 594parsing. 595 596 597= 3.0.7a = 598 599Added an import that makes BS work in Python 2.3. 600 601 602= 3.0.7 = 603 604Fixed a UnicodeDecodeError when unpickling documents that contain 605non-ASCII characters. 606 607Fixed a TypeError that occured in some circumstances when a tag 608contained no text. 609 610Jump through hoops to avoid the use of chardet, which can be extremely 611slow in some circumstances. UTF-8 documents should never trigger the 612use of chardet. 613 614Whitespace is preserved inside <pre> and <textarea> tags that contain 615nothing but whitespace. 616 617Beautiful Soup can now parse a doctype that's scoped to an XML namespace. 618 619 620= 3.0.6 = 621 622Got rid of a very old debug line that prevented chardet from working. 623 624Added a Tag.decompose() method that completely disconnects a tree or a 625subset of a tree, breaking it up into bite-sized pieces that are 626easy for the garbage collecter to collect. 627 628Tag.extract() now returns the tag that was extracted. 629 630Tag.findNext() now does something with the keyword arguments you pass 631it instead of dropping them on the floor. 632 633Fixed a Unicode conversion bug. 634 635Fixed a bug that garbled some <meta> tags when rewriting them. 636 637 638= 3.0.5 = 639 640Soup objects can now be pickled, and copied with copy.deepcopy. 641 642Tag.append now works properly on existing BS objects. (It wasn't 643originally intended for outside use, but it can be now.) (Giles 644Radford) 645 646Passing in a nonexistent encoding will no longer crash the parser on 647Python 2.4 (John Nagle). 648 649Fixed an underlying bug in SGMLParser that thinks ASCII has 255 650characters instead of 127 (John Nagle). 651 652Entities are converted more consistently to Unicode characters. 653 654Entity references in attribute values are now converted to Unicode 655characters when appropriate. Numeric entities are always converted, 656because SGMLParser always converts them outside of attribute values. 657 658ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to 659XHTML_ENTITIES. 660 661The regular expression for bare ampersands was too loose. In some 662cases ampersands were not being escaped. (Sam Ruby?) 663 664Non-breaking spaces and other special Unicode space characters are no 665longer folded to ASCII spaces. (Robert Leftwich) 666 667Information inside a TEXTAREA tag is now parsed literally, not as HTML 668tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) 669 670= 3.0.4 = 671 672Fixed a bug that crashed Unicode conversion in some cases. 673 674Fixed a bug that prevented UnicodeDammit from being used as a 675general-purpose data scrubber. 676 677Fixed some unit test failures when running against Python 2.5. 678 679When considering whether to convert smart quotes, UnicodeDammit now 680looks at the original encoding in a case-insensitive way. 681 682= 3.0.3 (20060606) = 683 684Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be 685sure to pass in an appropriate value for convertEntities, or XML/HTML 686entities might stick around that aren't valid in HTML/XML). The result 687may not validate, but it should be good enough to not choke a 688real-world XML parser. Specifically, the output of a properly 689constructed soup object should always be valid as part of an XML 690document, but parts may be missing if they were missing in the 691original. As always, if the input is valid XML, the output will also 692be valid. 693 694= 3.0.2 (20060602) = 695 696Previously, Beautiful Soup correctly handled attribute values that 697contained embedded quotes (sometimes by escaping), but not other kinds 698of XML character. Now, it correctly handles or escapes all special XML 699characters in attribute values. 700 701I aliased methods to the 2.x names (fetch, find, findText, etc.) for 702backwards compatibility purposes. Those names are deprecated and if I 703ever do a 4.0 I will remove them. I will, I tell you! 704 705Fixed a bug where the findAll method wasn't passing along any keyword 706arguments. 707 708When run from the command line, Beautiful Soup now acts as an HTML 709pretty-printer, not an XML pretty-printer. 710 711= 3.0.1 (20060530) = 712 713Reintroduced the "fetch by CSS class" shortcut. I thought keyword 714arguments would replace it, but they don't. You can't call soup('a', 715class='foo') because class is a Python keyword. 716 717If Beautiful Soup encounters a meta tag that declares the encoding, 718but a SoupStrainer tells it not to parse that tag, Beautiful Soup will 719no longer try to rewrite the meta tag to mention the new 720encoding. Basically, this makes SoupStrainers work in real-world 721applications instead of crashing the parser. 722 723= 3.0.0 "Who would not give all else for two p" (20060528) = 724 725This release is not backward-compatible with previous releases. If 726you've got code written with a previous version of the library, go 727ahead and keep using it, unless one of the features mentioned here 728really makes your life easier. Since the library is self-contained, 729you can include an old copy of the library in your old applications, 730and use the new version for everything else. 731 732The documentation has been rewritten and greatly expanded with many 733more examples. 734 735Beautiful Soup autodetects the encoding of a document (or uses the one 736you specify), and converts it from its native encoding to 737Unicode. Internally, it only deals with Unicode strings. When you 738print out the document, it converts to UTF-8 (or another encoding you 739specify). [Doc reference] 740 741It's now easy to make large-scale changes to the parse tree without 742screwing up the navigation members. The methods are extract, 743replaceWith, and insert. [Doc reference. See also Improving Memory 744Usage with extract] 745 746Passing True in as an attribute value gives you tags that have any 747value for that attribute. You don't have to create a regular 748expression. Passing None for an attribute value gives you tags that 749don't have that attribute at all. 750 751Tag objects now know whether or not they're self-closing. This avoids 752the problem where Beautiful Soup thought that tags like <BR /> were 753self-closing even in XML documents. You can customize the self-closing 754tags for a parser object by passing them in as a list of 755selfClosingTags: you don't have to subclass anymore. 756 757There's a new built-in parser, MinimalSoup, which has most of 758BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc 759reference] 760 761You can use a SoupStrainer to tell Beautiful Soup to parse only part 762of a document. This saves time and memory, often making Beautiful Soup 763about as fast as a custom-built SGMLParser subclass. [Doc reference, 764SoupStrainer reference] 765 766You can (usually) use keyword arguments instead of passing a 767dictionary of attributes to a search method. That is, you can replace 768soup(args={"id" : "5"}) with soup(id="5"). You can still use args if 769(for instance) you need to find an attribute whose name clashes with 770the name of an argument to findAll. [Doc reference: **kwargs attrs] 771 772The method names have changed to the better method names used in 773Rubyful Soup. Instead of find methods and fetch methods, there are 774only find methods. Instead of a scheme where you can't remember which 775method finds one element and which one finds them all, we have find 776and findAll. In general, if the method name mentions All or a plural 777noun (eg. findNextSiblings), then it finds many elements 778method. Otherwise, it only finds one element. [Doc reference] 779 780Some of the argument names have been renamed for clarity. For instance 781avoidParserProblems is now parserMassage. 782 783Beautiful Soup no longer implements a feed method. You need to pass a 784string or a filehandle into the soup constructor, not with feed after 785the soup has been created. There is still a feed method, but it's the 786feed method implemented by SGMLParser and calling it will bypass 787Beautiful Soup and cause problems. 788 789The NavigableText class has been renamed to NavigableString. There is 790no NavigableUnicodeString anymore, because every string inside a 791Beautiful Soup parse tree is a Unicode string. 792 793findText and fetchText are gone. Just pass a text argument into find 794or findAll. 795 796Null was more trouble than it was worth, so I got rid of it. Anything 797that used to return Null now returns None. 798 799Special XML constructs like comments and CDATA now have their own 800NavigableString subclasses, instead of being treated as oddly-formed 801data. If you parse a document that contains CDATA and write it back 802out, the CDATA will still be there. 803 804When you're parsing a document, you can get Beautiful Soup to convert 805XML or HTML entities into the corresponding Unicode characters. [Doc 806reference] 807 808= 2.1.1 (20050918) = 809 810Fixed a serious performance bug in BeautifulStoneSoup which was 811causing parsing to be incredibly slow. 812 813Corrected several entities that were previously being incorrectly 814translated from Microsoft smart-quote-like characters. 815 816Fixed a bug that was breaking text fetch. 817 818Fixed a bug that crashed the parser when text chunks that look like 819HTML tag names showed up within a SCRIPT tag. 820 821THEAD, TBODY, and TFOOT tags are now nestable within TABLE 822tags. Nested tables should parse more sensibly now. 823 824BASE is now considered a self-closing tag. 825 826= 2.1.0 "Game, or any other dish?" (20050504) = 827 828Added a wide variety of new search methods which, given a starting 829point inside the tree, follow a particular navigation member (like 830nextSibling) over and over again, looking for Tag and NavigableText 831objects that match certain criteria. The new methods are findNext, 832fetchNext, findPrevious, fetchPrevious, findNextSibling, 833fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, 834findParent, and fetchParents. All of these use the same basic code 835used by first and fetch, so you can pass your weird ways of matching 836things into these methods. 837 838The fetch method and its derivatives now accept a limit argument. 839 840You can now pass keyword arguments when calling a Tag object as though 841it were a method. 842 843Fixed a bug that caused all hand-created tags to share a single set of 844attributes. 845 846= 2.0.3 (20050501) = 847 848Fixed Python 2.2 support for iterators. 849 850Fixed a bug that gave the wrong representation to tags within quote 851tags like <script>. 852 853Took some code from Mark Pilgrim that treats CDATA declarations as 854data instead of ignoring them. 855 856Beautiful Soup's setup.py will now do an install even if the unit 857tests fail. It won't build a source distribution if the unit tests 858fail, so I can't release a new version unless they pass. 859 860= 2.0.2 (20050416) = 861 862Added the unit tests in a separate module, and packaged it with 863distutils. 864 865Fixed a bug that sometimes caused renderContents() to return a Unicode 866string even if there was no Unicode in the original string. 867 868Added the done() method, which closes all of the parser's open 869tags. It gets called automatically when you pass in some text to the 870constructor of a parser class; otherwise you must call it yourself. 871 872Reinstated some backwards compatibility with 1.x versions: referencing 873the string member of a NavigableText object returns the NavigableText 874object instead of throwing an error. 875 876= 2.0.1 (20050412) = 877 878Fixed a bug that caused bad results when you tried to reference a tag 879name shorter than 3 characters as a member of a Tag, eg. tag.table.td. 880 881Made sure all Tags have the 'hidden' attribute so that an attempt to 882access tag.hidden doesn't spawn an attempt to find a tag named 883'hidden'. 884 885Fixed a bug in the comparison operator. 886 887= 2.0.0 "Who cares for fish?" (20050410) 888 889Beautiful Soup version 1 was very useful but also pretty stupid. I 890originally wrote it without noticing any of the problems inherent in 891trying to build a parse tree out of ambiguous HTML tags. This version 892solves all of those problems to my satisfaction. It also adds many new 893clever things to make up for the removal of the stupid things. 894 895== Parsing == 896 897The parser logic has been greatly improved, and the BeautifulSoup 898class should much more reliably yield a parse tree that looks like 899what the page author intended. For a particular class of odd edge 900cases that now causes problems, there is a new class, 901ICantBelieveItsBeautifulSoup. 902 903By default, Beautiful Soup now performs some cleanup operations on 904text before parsing it. This is to avoid common problems with bad 905definitions and self-closing tags that crash SGMLParser. You can 906provide your own set of cleanup operations, or turn it off 907altogether. The cleanup operations include fixing self-closing tags 908that don't close, and replacing Microsoft smart quotes and similar 909characters with their HTML entity equivalents. 910 911You can now get a pretty-print version of parsed HTML to get a visual 912picture of how Beautiful Soup parses it, with the Tag.prettify() 913method. 914 915== Strings and Unicode == 916 917There are separate NavigableText subclasses for ASCII and Unicode 918strings. These classes directly subclass the corresponding base data 919types. This means you can treat NavigableText objects as strings 920instead of having to call methods on them to get the strings. 921 922str() on a Tag always returns a string, and unicode() always returns 923Unicode. Previously it was inconsistent. 924 925== Tree traversal == 926 927In a first() or fetch() call, the tag name or the desired value of an 928attribute can now be any of the following: 929 930 * A string (matches that specific tag or that specific attribute value) 931 * A list of strings (matches any tag or attribute value in the list) 932 * A compiled regular expression object (matches any tag or attribute 933 value that matches the regular expression) 934 * A callable object that takes the Tag object or attribute value as a 935 string. It returns None/false/empty string if the given string 936 doesn't match, and any other value if it does. 937 938This is much easier to use than SQL-style wildcards (see, regular 939expressions are good for something). Because of this, I took out 940SQL-style wildcards. I'll put them back if someone complains, but 941their removal simplifies the code a lot. 942 943You can use fetch() and first() to search for text in the parse tree, 944not just tags. There are new alias methods fetchText() and firstText() 945designed for this purpose. As with searching for tags, you can pass in 946a string, a regular expression object, or a method to match your text. 947 948If you pass in something besides a map to the attrs argument of 949fetch() or first(), Beautiful Soup will assume you want to match that 950thing against the "class" attribute. When you're scraping 951well-structured HTML, this makes your code a lot cleaner. 952 9531.x and 2.x both let you call a Tag object as a shorthand for 954fetch(). For instance, foo("bar") is a shorthand for 955foo.fetch("bar"). In 2.x, you can also access a specially-named member 956of a Tag object as a shorthand for first(). For instance, foo.barTag 957is a shorthand for foo.first("bar"). By chaining these shortcuts you 958traverse a tree in very little code: for header in 959soup.bodyTag.pTag.tableTag('th'): 960 961If an element relationship (like parent or next) doesn't apply to a 962tag, it'll now show up Null instead of None. first() will also return 963Null if you ask it for a nonexistent tag. Null is an object that's 964just like None, except you can do whatever you want to it and it'll 965give you Null instead of throwing an error. 966 967This lets you do tree traversals like soup.htmlTag.headTag.titleTag 968without having to worry if the intermediate stages are actually 969there. Previously, if there was no 'head' tag in the document, headTag 970in that instance would have been None, and accessing its 'titleTag' 971member would have thrown an AttributeError. Now, you can get what you 972want when it exists, and get Null when it doesn't, without having to 973do a lot of conditionals checking to see if every stage is None. 974 975There are two new relations between page elements: previousSibling and 976nextSibling. They reference the previous and next element at the same 977level of the parse tree. For instance, if you have HTML like this: 978 979 <p><ul><li>Foo<br /><li>Bar</ul> 980 981The first 'li' tag has a previousSibling of Null and its nextSibling 982is the second 'li' tag. The second 'li' tag has a nextSibling of Null 983and its previousSibling is the first 'li' tag. The previousSibling of 984the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the 985'br' tag. 986 987I took out the ability to use fetch() to find tags that have a 988specific list of contents. See, I can't even explain it well. It was 989really difficult to use, I never used it, and I don't think anyone 990else ever used it. To the extent anyone did, they can probably use 991fetchText() instead. If it turns out someone needs it I'll think of 992another solution. 993 994== Tree manipulation == 995 996You can add new attributes to a tag, and delete attributes from a 997tag. In 1.x you could only change a tag's existing attributes. 998 999== Porting Considerations == 1000 1001There are three changes in 2.0 that break old code: 1002 1003In the post-1.2 release you could pass in a function into fetch(). The 1004function took a string, the tag name. In 2.0, the function takes the 1005actual Tag object. 1006 1007It's no longer to pass in SQL-style wildcards to fetch(). Use a 1008regular expression instead. 1009 1010The different parsing algorithm means the parse tree may not be shaped 1011like you expect. This will only actually affect you if your code uses 1012one of the affected parts. I haven't run into this problem yet while 1013porting my code. 1014 1015= Between 1.2 and 2.0 = 1016 1017This is the release to get if you want Python 1.5 compatibility. 1018 1019The desired value of an attribute can now be any of the following: 1020 1021 * A string 1022 * A string with SQL-style wildcards 1023 * A compiled RE object 1024 * A callable that returns None/false/empty string if the given value 1025 doesn't match, and any other value otherwise. 1026 1027This is much easier to use than SQL-style wildcards (see, regular 1028expressions are good for something). Because of this, I no longer 1029recommend you use SQL-style wildcards. They may go away in a future 1030release to clean up the code. 1031 1032Made Beautiful Soup handle processing instructions as text instead of 1033ignoring them. 1034 1035Applied patch from Richie Hindle (richie at entrian dot com) that 1036makes tag.string a shorthand for tag.contents[0].string when the tag 1037has only one string-owning child. 1038 1039Added still more nestable tags. The nestable tags thing won't work in 1040a lot of cases and needs to be rethought. 1041 1042Fixed an edge case where searching for "%foo" would match any string 1043shorter than "foo". 1044 1045= 1.2 "Who for such dainties would not stoop?" (20040708) = 1046 1047Applied patch from Ben Last (ben at benlast dot com) that made 1048Tag.renderContents() correctly handle Unicode. 1049 1050Made BeautifulStoneSoup even dumber by making it not implicitly close 1051a tag when another tag of the same type is encountered; only when an 1052actual closing tag is encountered. This change courtesy of Fuzzy (mike 1053at pcblokes dot com). BeautifulSoup still works as before. 1054 1055= 1.1 "Swimming in a hot tureen" = 1056 1057Added more 'nestable' tags. Changed popping semantics so that when a 1058nestable tag is encountered, tags are popped up to the previously 1059encountered nestable tag (of whatever kind). I will revert this if 1060enough people complain, but it should make more people's lives easier 1061than harder. This enhancement was suggested by Anthony Baxter (anthony 1062at interlink dot com dot au). 1063 1064= 1.0 "So rich and green" (20040420) = 1065 1066Initial release. 1067