• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1                        TagSoup - Just Keep On Truckin'
2
3  Introduction
4
5   This is the home page of TagSoup, a SAX-compliant parser written in
6   Java that, instead of parsing well-formed or valid XML, parses HTML as
7   it is found in the wild: [1]poor, nasty and brutish, though quite often
8   far from short. TagSoup is designed for people who have to process this
9   stuff using some semblance of a rational application design. By
10   providing a SAX interface, it allows standard XML tools to be applied
11   to even the worst HTML. TagSoup also includes a command-line processor
12   that reads HTML files and can generate either clean HTML or well-formed
13   XML that is a close approximation to XHTML.
14
15   This is also the README file packaged with TagSoup.
16
17   TagSoup is free and Open Source software. As of version 1.2, it is
18   licensed under the [2]Apache License, Version 2.0, which allows
19   proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
20   projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
21   project, feel free to ask.)
22
23  Warning: TagSoup will not build on stock Java 5.x or 6.x!
24
25   Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
26   TagSoup will not build out of the box. You need to retrieve [3]Saxon
27   6.5.5, which does not have the bug. Unpack the zipfile in an empty
28   directory and copy the saxon.jar and saxon-xml-apis.jar files to
29   $ANT_HOME/lib. The Ant build process for TagSoup will then notice that
30   Saxon is available and use it instead.
31
32  TagSoup 1.2 released
33
34   There are a great many changes, most of them fixes for long-standing
35   bugs, in this release. Only the most important are listed here; for the
36   rest, see the CHANGES file in the source distribution. Very special
37   thanks to Jojo Dijamco, whose intensive efforts at debugging made this
38   release a usable upgrade rather than a useless mass of undetected bugs.
39     * As noted above, I have changed the license to Apache 2.0.
40     * The default content model for bogons (unknown elements) is now ANY
41       rather than EMPTY. This is a breaking change, which I have done
42       only because there was so much demand for it. It can be undone on
43       the command line with the --emptybogons switch, or programmatically
44       with parser.setFeature(Parser.emptyBogonsFeature, true).
45     * The processing of entity references in attribute values has finally
46       been fixed to do what browsers do. That is, a reference is only
47       recognized if it is properly terminated by a semicolon; otherwise
48       it is treated as plain text. This means that URIs like
49       foo?cdown=32&cup=42 are no longer seen as containing an instance of
50       the )U character (whose name happens to be cup).
51     * Several new switches have been added:
52          + --doctype-system and --doctype-public force a DOCTYPE
53            declaration to be output and allow setting the system and
54            public identifiers.
55          + --standalone and --version allow control of the XML
56            declaration that is output. (Note that TagSoup's XML output is
57            always version 1.0, even if you use --version=1.1.)
58          + --norootbogons causes unknown elements not to be allowed as
59            the document root element. Instead, they are made children of
60            the default root element (the html element for HTML).
61     * The TagSoup core now supports character entities with values above
62       U+FFFF. As a consequence, the HTML schema now supports all 2,210
63       standard character entities from the [4]2007-12-14 draft of XML
64       Entity Definitions for Characters, except the 94 which require more
65       than one Unicode character to represent.
66     * The SAX events startPrefixMapping and endPrefixMapping are now
67       being reported for all cases of foreign elements and attributes.
68     * All bugs around newline processing on Windows should now be gone.
69     * A number of content models have been loosened to allow elements to
70       appear in new and non-standard (but commonly found) places. In
71       particular, tables are now allowed inside paragraphs, against the
72       letter of the W3C specification.
73     * Since the span element is intended for fine control of appearance
74       using CSS, it should never have been a restartable element. This
75       very long-standing bug has now been fixed.
76     * The following non-standard elements are now at least partly
77       supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
78       rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
79     * In HTML output mode, boolean attributes like checked are now output
80       as such, rather than in XML style as checked="checked".
81     * Runs of < characters such as << and <<< are now handled correctly
82       in text rather than being transformed into extremely bogus
83       start-tags.
84
85   [5]Download the TagSoup 1.2 jar file here. It's about 87K long.
86   [6]Download the full TagSoup 1.2 source here. If you don't have zip,
87   you can use jar to unpack it.
88   [7]Download the current CHANGES file here.
89
90  TagSoup 1.1 released
91
92   TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
93   TagSoup within the JAXP framework (which is not something I necessarily
94   recommend, but it is part of the Java XML platform), you can create a
95   SAXParser by calling
96   org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
97   set the system property javax.xml.parsers.SAXParserFactory to
98   org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
99   this will cause all JAXP-based XML parsing to go through TagSoup, which
100   is a Bad Thing if your application also reads XML documents.
101
102  What TagSoup does
103
104   TagSoup is designed as a parser, not a whole application; it isn't
105   intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
106   to parse it on the fly. Therefore, it does not convert presentation
107   HTML to CSS or anything similar. It does guarantee well-structured
108   results: tags will wind up properly nested, default attributes will
109   appear appropriately, and so on.
110
111   The semantics of TagSoup are as far as practical those of actual HTML
112   browsers. In particular, never, never will it throw any sort of syntax
113   error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
114   much, much more. For example, if the first tag is LI, it will supply
115   the application with enclosing HTML, BODY, and UL tags. Why UL? Because
116   that's what browsers assume in this situation. For the same reason,
117   overlapping tags are correctly restarted whenever possible: text like:
118This is <B>bold, <I>bold italic, </b>italic, </i>normal text
119
120   gets correctly rewritten as:
121This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
122
123   By intention, TagSoup is small and fast. It does not depend on the
124   existence of any framework other than SAX, and should be able to work
125   with any framework that can accept SAX parsers. In particular, [10]XOM
126   is known to work.
127
128   You can replace the low-level HTML scanner with one based on Sean
129   McGrath's [11]PYX format (very close to James Clark's ESIS format). You
130   can also supply an AutoDetector that peeks at the incoming byte stream
131   and guesses a character encoding for it. Otherwise, the platform
132   default is used. If you need an autodetector of character sets,
133   consider trying to adapt the [12]Mozilla one; if you succeed, let me
134   know.
135
136  Note: TagSoup in Java 1.1
137
138   If you go through the TagSoup source and replace all references to
139   HashMap with Hashtable and recompile, TagSoup will work fine in Java
140   1.1 VMs. Thanks to Thorbj�rn Vinne for this discovery.
141
142  The TSaxon XSLT-for-HTML processor
143
144   [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
145   of Michael Kay's Saxon XSLT version 1.0 implementation that includes
146   TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
147   process either HTML or XML documents with XSLT stylesheets.
148
149  TagSoup as a stand-alone program
150
151   It is possible to run TagSoup as a program by saying java -jar
152   tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
153   line will be parsed individually. If no files are specified, the
154   standard input is read.
155
156   The following options are understood:
157
158   --files
159          Output into individual files, with html extensions changed to
160          xhtml. Otherwise, all output is sent to the standard output.
161
162   --html
163          Output is in clean HTML: the XML declaration is suppressed, as
164          are end-tags for the known empty elements.
165
166   --omit-xml-declaration
167          The XML declaration is suppressed.
168
169   --method=html
170          End-tags for the known empty HTML elements are suppressed.
171
172   --doctype-system=systemid
173          Forces the output of a DOCTYPE declaration with the specified
174          systemid.
175
176   --doctype-public=publicid
177          Forces the output of a DOCTYPE declaration with the specified
178          publicid.
179
180   --version=version
181          Sets the version string in the XML declaration.
182
183   --standalone=[yes|no]
184          Sets the standalone declaration to yes or no.
185
186   --pyx
187          Output is in PYX format.
188
189   --pyxin
190          Input is in PYXoid format (need not be well-formed).
191
192   --nons
193          Namespaces are suppressed. Normally, all elements are in the
194          XHTML 1.x namespace, and all attributes are in no namespace.
195
196   --nobogons
197          Bogons (unknown elements) are suppressed.
198
199   --nodefaults
200          suppress default attribute values
201
202   --nocolons
203          change explicit colons in element and attribute names to
204          underscores
205
206   --norestart
207          don't restart any normally restartable elements
208
209   --ignorable
210          output whitespace in elements with element-only content
211
212   --emptybogons
213          Bogons are given a content model of EMPTY rather than ANY.
214
215   --any
216          Bogons are given a content model of ANY rather than EMPTY
217          (default).
218
219   --norootbogons
220          Don't allow bogons to be root elements; make them subordinate to
221          the root.
222
223   --lexical
224          Pass through HTML comments and DOCTYPE declarations. Has no
225          effect when output is in PYX format.
226
227   --reuse
228          Reuse a single instance of TagSoup parser throughout. Normally,
229          a new one is instantiated for each input file.
230
231   --nocdata
232          Change the content models of the script and style elements to
233          treat them as ordinary #PCDATA (text-only) elements, as in
234          XHTML, rather than with the special CDATA content model.
235
236   --encoding=encoding
237          Specify the input encoding. The default is the Java platform
238          default.
239
240   --output-encoding=encoding
241          Specify the output encoding. The default is the Java platform
242          default.
243
244   --help
245          Print help.
246
247   --version
248          Print the version number.
249
250  SAX features and properties
251
252   TagSoup supports the following SAX features in addition to the standard
253   ones:
254
255   http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
256          A value of "true" indicates that the parser will ignore unknown
257          elements.
258
259   http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
260          A value of "true" indicates that the parser will give unknown
261          elements a content model of EMPTY; a value of "false", a content
262          model of ANY.
263
264   http://www.ccil.org/~cowan/tagsoup/features/root-bogons
265          A value of "true" indicates that the parser will allow unknown
266          elements to be the root of the output document.
267
268   http://www.ccil.org/~cowan/tagsoup/features/default-attributes
269          A value of "true" indicates that the parser will return default
270          attribute values for missing attributes that have default
271          values.
272
273   http://www.ccil.org/~cowan/tagsoup/features/translate-colons
274          A value of "true" indicates that the parser will translate
275          colons into underscores in names.
276
277   http://www.ccil.org/~cowan/tagsoup/features/restart-elements
278          A value of "true" indicates that the parser will attempt to
279          restart the restartable elements.
280
281   http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
282          A value of "true" indicates that the parser will transmit
283          whitespace in element-only content via the SAX
284          ignorableWhitespace callback. Normally this is not done, because
285          HTML is an SGML application and SGML suppresses such whitespace.
286
287   http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
288          A value of "true" indicates that the parser will process the
289          script and style elements (or any elements with type='cdata' in
290          the TSSL schema) as SGML CDATA elements (that is, no markup is
291          recognized except the matching end-tag).
292
293   TagSoup supports the following SAX properties in addition to the
294   standard ones:
295
296   http://www.ccil.org/~cowan/tagsoup/properties/scanner
297          Specifies the Scanner object this parser uses.
298
299   http://www.ccil.org/~cowan/tagsoup/properties/schema
300          Specifies the Schema object this parser uses.
301
302   http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
303          Specifies the AutoDetector (for encoding detection) this parser
304          uses.
305
306  More information
307
308   I gave a presentation (a nocturne, so it's not on the schedule) at
309   [15]Extreme Markup Languages 2004 about TagSoup, updated from the one
310   presented in 2002 at the New York City XML SIG and at XML 2002. This is
311   the main high-level documentation about how TagSoup works. Formats:
312   [16]OpenDocument [17]Powerpoint [18]PDF.
313
314   I also had people add [19]"evil" HTML to a large poster so that I could
315   [20]clean it up; View Source is probably more useful than ordinary
316   browsing. The original instructions were:
317
318                         SOUPE DE BALISES (BE EVIL)!
319   Ecritez une balise ouvrante (sans attributs)
320   ou fermante HTML ici, s.v.p.
321
322   There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
323   You can [23]join via the Web, or by sending a blank email to
324   [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are
325   open to all.
326
327   Online TagSoup processing for publicly accessible HTML documents is now
328   [26]available courtesy of Leigh Dodds.
329
330References
331
332   1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
333   2. http://opensource.org/licenses/apache2.0.php
334   3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
335   4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
336   5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
337   6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
338   7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
339   8. http://tidy.sf.net/
340   9. http://www.crumbmuseum.com/truckin.html
341  10. http://www.cafeconleche.org/XOM
342  11. http://gnosis.cx/publish/programming/xml_matters_17.html
343  12. http://jchardet.sourceforge.net/
344  13. http://www.ccil.org/~cowan
345  14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
346  15. http://www.extrememarkup.com/extreme/2004
347  16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
348  17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
349  18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
350  19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
351  20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
352  21. http://groups.yahoo.com/group/tagsoup-friends
353  22. http://groups.yahoo.com/
354  23. http://groups.yahoo.com/group/tagsoup-friends/join
355  24. mailto:tagsoup-friends-subscribe@yahoogroups.com
356  25. http://groups.yahoo.com/group/tagsoup-friends/messages
357  26. http://xmlarmyknife.org/docs/xhtml/tagsoup/
358