KKc@sdZddklZdZdZdZdZddkZddkZddk Z ddk Z ddk l Z l Z ydd k lZWnej o hZnXyeWn#ej odd klZnXe id ie_d Zd ZdfdYZdeefdYZdefdYZdefdYZdefdYZdefdYZdefdYZ dfdYZ!de"fdYZ#d Z$d!Z%d"Z&d#e fd$YZ'd%e fd&YZ(d'e(fd(YZ)d)e*fd*YZ+d+e)fd,YZ,d-e)fd.YZ-d/e(fd0YZ.d1e(fd2YZ/d3e)fd4YZ0d5e,fd6YZ1d7e-fd8YZ2d9e.fd:YZ3yddk4Z4Wnej o e5Z4nXyddk6Z7Wnej onXyddk8Z8Wnej onXd;fd<YZ9e:d=jo*ddk;Z;e)e;i<Z=e=i>GHndS(>s Beautiful Soup Elixir and Tonic "The Screen-Scraper's Friend" http://www.crummy.com/software/BeautifulSoup/ Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree. A well-formed XML/HTML document yields a well-formed data structure. An ill-formed XML/HTML document yields a correspondingly ill-formed data structure. If your document is only locally well-formed, you can use this library to find and process the well-formed part of it. Beautiful Soup works with Python 2.2 and up. It has no external dependencies, but you'll have more success at converting data to UTF-8 if you also install these three packages: * chardet, for auto-detecting character encodings http://chardet.feedparser.org/ * cjkcodecs and iconv_codec, which add more encodings to the ones supported by stock Python. http://cjkpython.i18n.org/ Beautiful Soup defines classes for two main parsing strategies: * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific language that kind of looks like XML. * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid or invalid. This class has web browser-like heuristics for obtaining a sensible parse tree in the face of common HTML errors. Beautiful Soup also defines a class (UnicodeDammit) for autodetecting the encoding of an HTML or XML document, and converting it to Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser. For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html Here, have some legalese: Copyright (c) 2004-2009, Leonard Richardson All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the the Beautiful Soup Consortium and All Night Kosher Bakery nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT. i(t generatorss*Leonard Richardson (leonardr@segfault.org)s3.1.0.1s*Copyright (c) 2004-2009 Leonard Richardsons New-style BSDN(t HTMLParsertHTMLParseError(tname2codepoint(tSets-zA-Z][-_.:a-zA-Z0-9]*\s*sutf-8cCs&|djo|Sn|i|SdS(s8Returns either the given Unicode string or its encoding.N(tNonetencode(tunicodetencoding((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytsobks t PageElementcBsveZdZdddZdZdZdZdZdZ dhddZ dhdddZ dhdd Z dhddd Z e Zdhdd Zdhddd ZeZdhdd ZdhdddZeZdhdZdhddZeZdZdZdZdZdZdZdZddZddZ RS(seContains the navigational information for some part of the page (either a tag or a piece of text)cCsk||_||_d|_d|_d|_|io0|iio#|iid|_||i_ndS(sNSets up the initial relations between this element and other elements.iN(tparenttpreviousRtnexttpreviousSiblingt nextSiblingtcontents(tselfR R ((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytsetupvs     cCs|i}|iii|}t|doN|i|ijo;|iii|}|o||jo|d}q|n|i|i||dS(NR i(R Rtindexthasattrtextracttinsert(Rt replaceWitht oldParenttmyIndexR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs # cCs|io1y|iii|Wq;tj oq;Xn|i}|i}|io||i_n|o|i|_nd|_d|_d|_|io|i |i_ n|i o|i|i _nd|_|_ |S(s0Destructively rips this element out of the tree.N( R Rtremovet ValueErrort_lastRecursiveChildR R RRR(Rt lastChildt nextElement((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs(          cCs9|}x,t|do|io|id}q W|S(s8Finds the last element beneath this object to be parsed.Ri(RR(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs c Cs<t|tpt|to!t|t ot|}nt|t|i}t|doc|idjoS|i|jo5|i |}|o||jo|d}qn|i n||_d}|djod|_ ||_ n6|i|d}||_ ||i _|i|_ |i o||i _n|i}|t|ijocd|_|}d}x*|p"|i}|i}|pPqqW|o ||_q d|_n:|i|}||_|io||i_ n||_|io||i_ n|ii||dS(NR ii(t isinstancet basestringRtNavigableStringtmintlenRRR RtfindRRR RRR R( RtpositiontnewChildRt previousChildtnewChildsLastElementR tparentsNextSiblingt nextChild((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRsX                    cCs|it|i|dS(s2Appends the given tag to the contents of this tag.N(RR#R(Rttag((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytappendscKs|i|i||||S(sjReturns the first item that matches the given criteria and appears after this Tag in the document.(t_findOnet findAllNext(Rtnametattrsttexttkwargs((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytfindNextscKs|i|||||i|S(sbReturns all items that match the given criteria and appear after this Tag in the document.(t_findAllt nextGenerator(RR/R0R1tlimitR2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR.scKs|i|i||||S(s{Returns the closest sibling to this Tag that matches the given criteria and appears after this Tag in the document.(R-tfindNextSiblings(RR/R0R1R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytfindNextSiblingscKs|i|||||i|S(sqReturns the siblings of this Tag that match the given criteria and appear after this Tag in the document.(R4tnextSiblingGenerator(RR/R0R1R6R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR7scKs|i|i||||S(skReturns the first item that matches the given criteria and appears before this Tag in the document.(R-tfindAllPrevious(RR/R0R1R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt findPreviousscKs|i|||||i|S(scReturns all items that match the given criteria and appear before this Tag in the document.(R4tpreviousGenerator(RR/R0R1R6R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR:scKs|i|i||||S(s|Returns the closest sibling to this Tag that matches the given criteria and appears before this Tag in the document.(R-tfindPreviousSiblings(RR/R0R1R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytfindPreviousSiblingscKs|i|||||i|S(srReturns the siblings of this Tag that match the given criteria and appear before this Tag in the document.(R4tpreviousSiblingGenerator(RR/R0R1R6R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR=!scKs4d}|i||d}|o|d}n|S(sOReturns the closest parent of this Tag that matches the given criteria.iiN(Rt findParents(RR/R0R2trtl((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt findParent)s cKs|i||d||i|S(sFReturns the parents of this Tag that match the given criteria.N(R4RtparentGenerator(RR/R0R6R2((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR@4scKs7d}||||d|}|o|d}n|S(Nii(R(RtmethodR/R0R1R2RARB((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR->s c Kst|to |}nt||||}t|}|} xto|y| i} Wntj oPnX| oJ|i| } | o0|i| |ot||joPqqqGqGW|S(s8Iterates over a generator looking for things that match.( Rt SoupStrainert ResultSettTrueR t StopIterationtsearchR,R#( RR/R0R1R6t generatorR2tstrainertresultstgtitfound((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR4Es$    ccs'|}x|o|i}|Vq WdS(N(R (RRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR5^s  ccs'|}x|o|i}|Vq WdS(N(R(RRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR9ds  ccs'|}x|o|i}|Vq WdS(N(R (RRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR<js  ccs'|}x|o|i}|Vq WdS(N(R(RRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR?ps  ccs'|}x|o|i}|Vq WdS(N(R (RRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRDvs  cCs|pd}|id|S(Nsutf-8s%SOUP-ENCODING%(treplace(RtstrR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytsubstituteEncoding}s cCst|to|o|i|}qnjt|to*|o|i|}qt|}n0|o|it||}n t|}|S(sHEncodes an object to a string in some encoding, or to Unicode. .(RRRRRt toEncoding(RtsR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRTs N(!t__name__t __module__t__doc__RRRRRRR,R3R.R8R7tfetchNextSiblingsR;R:t fetchPreviousR>R=tfetchPreviousSiblingsRCR@t fetchParentsR-R4R5R9R<R?RDRSRT(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR rs>    <            R!cBs8eZdZdZdZedZdZRS(cCs7t|toti||Snti||tS(s-Create a new NavigableString. When unpickling a NavigableString, this method is called with the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be passed in to the superclass's __new__ or the superclass won't know how to handle non-ASCII characters. (RRt__new__tDEFAULT_OUTPUT_ENCODING(tclstvalue((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR]scCs t|fS(N(R(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt__getnewargs__scCs2|djo|Sntd|ii|fdS(stext.string gives you text. This is for backwards compatibility for Navigable*String, but for CData* it lets you get the string without the CData wrapper.tstrings!'%s' object has no attribute '%s'N(tAttributeErrort __class__RV(Rtattr((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt __getattr__s cCs|ii|S(N(tdecodeR(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRscCs|S(N((RteventualEncoding((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytdecodeGivenEventualEncodings(RVRWR]RaRfR^RRi(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR!s   tCDatacBseZdZRS(cCs d|dS(Nu ((RRh((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRis(RVRWRi(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRjstProcessingInstructioncBseZdZRS(cCs5|}d|jo|i||}nd|dS(Nu%SOUP-ENCODING%u(RS(RRhtoutput((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRis (RVRWRi(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRkstCommentcBseZdZRS(cCs d|dS(Nu((RRh((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRis(RVRWRi(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRmst DeclarationcBseZdZRS(cCs d|dS(Nu((RRh((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRis(RVRWRi(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRnstTagcBseZdZdZhdd<dd<dd<dd <d d tgtcCs |id}|io|tjott|Sn||ijo%|io|i|Sqd|Snt|djoh|ddjoWt|djo,|ddjott|ddSqtt|dSn|io d|Sn d|Sd S( sUsed in a call to re.sub to replace HTML, XML, and numeric entities with the appropriate Unicode characters. If HTML entities are being converted, any unrecognized entities are escaped.iu&%s;it#txiiu&%s;N( tgrouptconvertHTMLEntitiesRtunichrtXML_ENTITIES_TO_SPECIAL_CHARStconvertXMLEntitiesR#tinttescapeUnrecognizedEntities(RtmatchR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt_convertEntitiess  $$  cs|i_|i|_|_|djo g}n|_g_i||t _ t _ |i _ |i _ |i_fd}t|i_dS(sBasic constructor.cs=|\}}|djo|Sn|tidi|fS(s?Converts HTML, XML and numeric entities in the attribute value.s&(#\d+|#x[0-9a-fA-F]+|\w+);N(RtretsubR(tkvalRrtval(R(sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytconverts    N(Rdt parserClasstisSelfClosingTagt isSelfClosingR/RR0RRtFalsethiddentcontainsSubstitutionsRRRtmap(RtparserR/R0R R R((RsP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt__init__s           cCs|ii||S(sReturns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.(t _getAttrMaptget(Rtkeytdefault((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRscCs|ii|S(N(Rthas_key(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRscCs|i|S(sqtag[key] returns the value of the 'key' attribute for the tag, and throws an exception if it's not there.(R(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt __getitem__scCs t|iS(s0Iterating over a tag iterates over its contents.(titerR(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt__iter__scCs t|iS(s:The length of a tag is the length of its list of contents.(R#R(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt__len__#scCs ||ijS(N(R(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt __contains__'scCstS(s-A tag is non-None even if it has no contents.(RH(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt __nonzero__*scCs|i||i|]|s&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)t)cCs d|i|idddS(smUsed with a regular expression to substitute the appropriate XML entity for an XML special character.Ryit;(tXML_SPECIAL_CHARS_TO_ENTITIESR(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt _sub_entityoscCs |iS(N(Rg(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt __unicode__tscCs |iS(N(R(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt__str__wsicCs|i|||i|S(N(RgR(RRt prettyPrintt indentLevel((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRzscCsg}|iox|iD]\}}d}t|o|io0|dj o#d|jo|i||}nd|jo-d}d|jo|idd}qn|ii|i|}n|djo |}n|||f}|i |qWnd} d} |i o d} nd |i } d\} } |o"|} d | d } | d } n|i || |}|i o |}ng}d}|od d i|}n|o|i | n|i d |i || f|o|i dn|i ||o)|o"|ddjo|i dn|o| o|i | n|i | |o"| o|io|i dndi|}|S(sxReturns a string or Unicode representation of this tag and its contents. To get Unicode, pass None for encoding.s%s="%s"s%SOUP-ENCODING%Rws%s='%s'Rus&squot;ts /sit is<%s%s%s>s iN(ii(R0tisStringRRRSRQtBARE_AMPERSAND_OR_BRACKETRRR,RR/tdecodeContentsRtjoinR(RRRRhR0RRtfmttdecodedtclosetcloseTagt indentTagtindentContentstspaceRRUtattributeString((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRg~sh                    cCskg}|iD] }||q~}x6|D].}t|to|iq+|iq+W|idS(s/Recursively destroys the contents of this tree.N(RRRot decomposeR(Rt_[1]ROR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs$cCs|i|tS(N(RRH(RR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytprettifyscCs|i||i|S(N(RR(RRRR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytencodeContentsscCsg}x|D]}d}t|to|i|}n1t|to |i|i|||n|o|o|i}n|oI|o|id|dn|i||o|idqq q Wdi|S(s{Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string..Ris RN( RRR!RiRoR,RgtstripR(RRRRhRUtcR1((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs"  cKs=d}|i||||d|}|o|d}n|S(sLReturn only the first child of this Tag matching the given criteria.iiN(RR(RR/R0t recursiveR1R2RARB((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR$s cKs9|i}|p |i}n|i||||||S(sExtracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have. The value of a key-value pair in the 'attrs' map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of 'matches'. The same is true of the tag name.(trecursiveChildGeneratortchildGeneratorR4(RR/R0RR1R6R2RK((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs  cCs|id|d|d|S(NR1RR6(R(RR1RR6((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt fetchTextscCs|id|d|S(NR1R(R$(RR1R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt firstTextscCs;|djo|i|||Sn|i|||SdS(N(RRR(RRRR((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytrenderContents"s cCsKt|dp4h|_x(|iD]\}}||i||iiott|}n d|}|i|dS(s$Handle character references as data.s&#%s;N(RtconvertEntitiesRRR(Rtreftdata((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pythandle_charrefs  cCsd}|iio.ytt|}WqAtj oqAXn| o&|iio|iii|}n| o2|iio%|iii| od|}n|pd|}n|i |dS(sHandle entity references as data, possibly converting known HTML and/or XML entity references to the corresponding Unicode characters.s&%ss&%s;N( RRRRRtKeyErrorRRRR(RR R ((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pythandle_entityrefs  cCs|i|tdS(s4Handle DOCTYPEs and the like as Declaration objects.N(RRn(RR ((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt handle_declGscCsd}|i||d!djog|iid|}|djot|i}n|i|d|!}|d}|i|tnWyti||}Wn=tj o1|i|}|i ||t|}nX|S(s`Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.i s iiN( RtrawdataR$R#RRjRtparse_declarationRR(RROtjRrR ttoHandle((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRKs    ( RVRWRRRRRRRR R RR(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs         + tBeautifulStoneSoupc BsLeZdZhZhZhZhZgZei ddfei ddfgZ dZ dZ dZ dZeZhdd <dd <dd <dd <dd " actually means "". [Another possible explanation is "", but since this class defines no SELF_CLOSING_TAGS, it will never use that explanation.] This class is useful for parsing XML or made-up markup languages, or when BeautifulSoup makes an assumption counter to what you were expecting.s (<[^<>]*)/>cCs|iddS(is />(R(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pytyss]*)>cCsd|iddS(s (No space between name of closing tag and tag close) (Extraneous whitespace in declaration) You can pass in a custom list of (RE object, replace method) tuples to get Beautiful Soup to scrub your input the way you want.treadtisHTMLN(tparseOnlyTheset fromEncodingt smartQuotesToRRt HTML_ENTITIESRRRHRRtXHTML_ENTITIESt XML_ENTITIESRtinstanceSelfClosingTagstbuildertresetRRRt markupMassaget_feedt StopParsing( RRRRR"RRtselfClosingTagsRR ((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRsD                     cCs@|i}t|to!t|dp d|_qnIt||i|gd|id|}|i}|i|_|i |_ |od|i oVt |i p|i |_ nx)|i D]\}}|i ||}qW|` qn|ii|ii||ix%|ii|ijo|iqWdS(NtoriginalEncodingRR(RRRRRR&t UnicodeDammitRRtdeclaredHTMLEncodingR"RtMARKUP_MASSAGERR R!tfeedRt currentTagR/t ROOT_TAG_NAMEtpopTag(RtinDocumentEncodingRRtdammittfixtm((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR#s0        cCs#|ii|p|ii|S(seReturns true iff the given string is the name of a self-closing tag according to this parser.(tSELF_CLOSING_TAGSRR(RR/((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRscCsati|||id|_|iig|_d|_g|_ g|_ |i |dS(Ni( RoRR,RR R!t currentDataRR+ttagStackt quoteStacktpushTag(R((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR!s      cCs|ii}t|iidjo4t|iidto|iid|i_n|io|id|_n|iS(Niii(R4tpopR#R+RRR!Rb(RR+((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR-s cCsE|io|iii|n|ii||id|_dS(Ni(R+RR,R4(RR+((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR6s cCsD|io6di|i}|i|idjo\tg}|iD]}||iqF~i|i o!d|jo d}qd}ng|_|i o@t |idjo*|i i p|i i | odSn||}|i |i|i|io||i_n||_|iii|ndS(NuRs Ri(R3Rt translatetSTRIP_ASCII_SPACEStsetR4R/t intersectiontPRESERVE_WHITESPACE_TAGSRR#R1RJRR+R R RR,(RtcontainerClassR3RR+to((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs& -        cCs||ijodSnd}d}xVtt|idddD]5}||i|ijot|i|}PqDqDW|p|d}nx#td|D]}|i}qW|S(sPops the tag stack up to and including the most recent instance of the given tag. If inclusivePop is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.Niii(R,RRR#R4R/R-(RR/t inclusivePoptnumPopst mostRecentTagRO((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt _popToTag1s  c Cs!|ii|}|dj}|ii|}d}t}xtt|idddD]}|i|}| p|i |jo| o |}Pn|djo|i |jp*|djo1|o*|ii|i o|i }t }Pn|i }q\W|o|i ||ndS(sWe need to pop up to the previous tag of this type, unless one of this tag's nesting reset triggers comes between this tag and the previous tag of this type, OR unless this tag is a generic nesting trigger and another generic nesting trigger comes between this tag and the previous tag of this type. Examples:

FooBar *

* should pop to 'p', not 'b'.

FooBar *

* should pop to 'table', not 'p'.

Foo

Bar *

* should pop to 'tr', not 'p'.

    • *
    • * should pop to 'ul', not the first 'li'.
  • ** should pop to 'table', not the first 'tr' tag should implicitly close the previous tag within the same
    ** should pop to 'tr', not the first 'td' iiiN( t NESTABLE_TAGSRRtRESET_NESTING_TAGSRRHRR#R4R/RR RB( RR/tnestingResetTriggerst isNestabletisResetNestingtpopTot inclusiveROtp((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyt _smartPopGs*       icCsh|io:ditd|}|id||fdSn|i|i| o| o|i|n|ioBt|i djo,|ii p|ii || odSnt ||||i |i}|io||i_n||_|i||p|i|o|in||ijo|ii|d|_n|S(NRcSs|\}}d||fS(s %s="%s"((t.0Rty((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRzss<%s%s>i(R5RRRRRRKRR#R4R1RRoR+R R R6R-t QUOTE_TAGSR,tliteral(RR/R0t selfClosingR+((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRus*   $    cCs|io-|id|jo|id|dSn|i|i||io=|id|jo)|iit|idj|_ndS(Nisi(R5RRRBR7R#RO(RR/((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRs   cCs|ii|dS(N(R3R,(RR ((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRscCs|id|dS(NR(R(RR0((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRsN(#RVRWRXR2RCRDRNR<RRR)R,RRRt ALL_ENTITIESRR9RHRRRR#RR!R-R6R!RRBRKRRRR(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyR`s@   3   E!      .  t BeautifulSoupc BseZdZdZed.dddddddd d g Zed d gZhd.d <d.d tag should implicitly close the previous

    tag.

    Para1

    Para2 should be transformed into:

    Para1

    Para2 Some tags can be nested arbitrarily. For instance, the occurance of a

    tag should _not_ implicitly close the previous
    tag. Alice said:
    Bob said:
    Blah should NOT be transformed into: Alice said:
    Bob said:
    Blah Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a
    , but not close a tag in another table.
    BlahBlah should be transformed into:
    BlahBlah but, Blah
    Blah should NOT be transformed into Blah
    Blah Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup, MinimalSoup, or BeautifulStoneSoup before writing your own subclass.cOsB|idp|i|dFooBar This is perfectly valid (if bizarre) HTML. However, the BeautifulSoup class will implicitly close the first b tag when it encounters the second 'b'. It will think the author wrote "FooBar", and didn't close the first 'b' tag, because there's no real-world reason to bold something that's already bold. When it encounters '' it will close two more 'b' tags, for a grand total of three tags closed instead of two. This can throw off the rest of your document structure. The same is true of a number of other tags, listed below. It's much more common for someone to forget to close a 'b' tag than to actually use nested 'b' tags, and the BeautifulSoup class handles the common case. This class handles the not-co-common case: where you can't believe someone wrote what they did, but it's valid HTML and BeautifulSoup screwed up by assuming it wouldn't be.temtbigROtsmallttttabbrtacronymtstrongtcitetcodetdfntkbdtsamptvartbRy(RVRWRXt*I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGSt)I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGSRRRRC(((sP/usr/local/google/WebKitToT/WebKit/WebKitTools/Scripts/webkitpy/BeautifulSoup.pyRDs   t MinimalSoupcBs eZdZedZhZRS(sThe MinimalSoup class is for parsing HTML that contains pathologically bad markup. It makes no assumptions about tag nesting, but it does know which tags are self-closing, that