tokenization.txt - OpenGrok cross reference for /external/owasp/sanitizer/lib/htmlparser-1.3/doc/tokenization.txt

Lines Matching full:the
14    Implementations must act as if they used the following state machine to
15    tokenise HTML. The state machine must start in the data state. Most
17    and either switches the state machine to a new state to reconsume the
18    same character, or switches it to a new state (to consume the next
19    character), or repeats the same state (to consume the next character).
23    The exact behavior of certain states depends on a content model flag
24    that is set after certain tokens are emitted. The flag has several
26    the PCDATA state. In the RCDATA and CDATA states, a further escape flag
27    is used to control the behavior of the tokeniser. It is either true or
28    false, and initially must be set to the false state. The insertion mode
29    and the stack of open elements also affects tokenization.
31    The output of the tokenization step is a series of zero or more of the
36    missing (which is a distinct state from the empty string), and the
44    When a token is emitted, it must immediately be handled by the tree
45    construction stage. The tree construction stage can affect the state of
46    the content model flag, and can insert additional characters into the
47    stream. (For example, the script element can result in scripts
48    executing and using the dynamic markup insertion APIs to insert
49    characters into the stream being tokenised.)
52    the flag is not acknowledged when it is processed by the tree
55    When an end tag token is emitted, the content model flag must be
56    switched to the PCDATA state.
64    Before each step of the tokeniser, the user agent must first check the
65    parser pause flag. If it is true, then the tokeniser must abort the
66    processing of any nested invocations of the tokeniser, yielding control
67    back to the caller. If it is false, then the user agent may then check
68    to see if either one of the scripts in the list of scripts that will
69    execute as soon as possible or the first script in the list of scripts
73    The tokeniser state machine consists of the states defined in the
78    Consume the next input character:
81           When the content model flag is set to one of the PCDATA or
82           RCDATA states and the escape flag is false: switch to the
84           Otherwise: treat it as per the "anything else" entry below.
87           If the content model flag is set to either the RCDATA state or
88           the CDATA state, and the escape flag is false, and there are at
89           least three characters before this one in the input stream, and
90           the last four characters in the input stream, including this
92           HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
95           In any case, emit the input character as a character token. Stay
96           in the data state.
99           When the content model flag is set to the PCDATA state: switch
100           to the tag open state.
101           When the content model flag is set to either the RCDATA state or
102           the CDATA state, and the escape flag is false: switch to the tag
104           Otherwise: treat it as per the "anything else" entry below.
107           If the content model flag is set to either the RCDATA state or
108           the CDATA state, and the escape flag is true, and the last three
109           characters in the input stream including this one are U+002D
111           ("-->"), set the escape flag to false.
113           In any case, emit the input character as a character token. Stay
114           in the data state.
120           Emit the input character as a character token. Stay in the data
125    (This cannot happen if the content model flag is set to the CDATA
133    Otherwise, emit the character token that was returned.
135    Finally, switch to the data state.
139    The behavior of this state depends on the content model flag.
141    If the content model flag is set to the RCDATA or CDATA states
142           Consume the next input character. If it is a U+002F SOLIDUS (/)
143           character, switch to the close tag open state. Otherwise, emit a
144           U+003C LESS-THAN SIGN character token and reconsume the current
145           input character in the data state.
147    If the content model flag is set to the PCDATA state
148           Consume the next input character:
151                 Switch to the markup declaration open state.
154                 Switch to the close tag open state.
158                 Create a new start tag token, set its tag name to the
159                 lowercase version of the input character (add 0x0020 to
160                 the character's code point), then switch to the tag name
161                 state. (Don't emit the token yet; further details will be
165                 Create a new start tag token, set its tag name to the
166                 input character, then switch to the tag name state. (Don't
167                 emit the token yet; further details will be filled in
173                 the data state.
176                 Parse error. Switch to the bogus comment state.
180                 and reconsume the current input character in the data
185    If the content model flag is set to the RCDATA or CDATA states but no
186    start tag token has ever been emitted by this instance of the tokeniser
187    (fragment case), or, if the content model flag is set to the RCDATA or
188    CDATA states and the next few characters do not match the tag name of
189    the last start tag token emitted (compared in an ASCII case-insensitive
191    the following characters:
201    character token, and switch to the data state to process the next input
204    Otherwise, if the content model flag is set to the PCDATA state, or if
205    the next few characters do match that tag name, consume the next input
209           Create a new end tag token, set its tag name to the lowercase
210           version of the input character (add 0x0020 to the character's
211           code point), then switch to the tag name state. (Don't emit the
216           Create a new end tag token, set its tag name to the input
217           character, then switch to the tag name state. (Don't emit the
222           Parse error. Switch to the data state.
226           U+002F SOLIDUS character token. Reconsume the EOF character in
227           the data state.
230           Parse error. Switch to the bogus comment state.
234    Consume the next input character:
240           Switch to the before attribute name state.
243           Switch to the self-closing start tag state.
246           Emit the current tag token. Switch to the data state.
249           Append the lowercase version of the current input character (add
250           0x0020 to the character's code point) to the current tag token's
251           tag name. Stay in the tag name state.
254           Parse error. Emit the current tag token. Reconsume the EOF
255           character in the data state.
258           Append the current input character to the current tag token's
259           tag name. Stay in the tag name state.
263    Consume the next input character:
269           Stay in the before attribute name state.
272           Switch to the self-closing start tag state.
275           Emit the current tag token. Switch to the data state.
278           Start a new attribute in the current tag token. Set that
279           attribute's name to the lowercase version of the current input
280           character (add 0x0020 to the character's code point), and its
281           value to the empty string. Switch to the attribute name state.
286           Parse error. Treat it as per the "anything else" entry below.
289           Parse error. Emit the current tag token. Reconsume the EOF
290           character in the data state.
293           Start a new attribute in the current tag token. Set that
294           attribute's name to the current input character, and its value
295           to the empty string. Switch to the attribute name state.
299    Consume the next input character:
305           Switch to the after attribute name state.
308           Switch to the self-closing start tag state.
311           Switch to the before attribute value state.
314           Emit the current tag token. Switch to the data state.
317           Append the lowercase version of the current input character (add
318           0x0020 to the character's code point) to the current attribute's
319           name. Stay in the attribute name state.
323           Parse error. Treat it as per the "anything else" entry below.
326           Parse error. Emit the current tag token. Reconsume the EOF
327           character in the data state.
330           Append the current input character to the current attribute's
331           name. Stay in the attribute name state.
333    When the user agent leaves the attribute name state (and before
334    emitting the tag token, if appropriate), the complete attribute's name
335    must be compared to the other attributes on the same token; if there is
336    already an attribute on the token with the exact same name, then this
337    is a parse error and the new attribute must be dropped, along with the
342    Consume the next input character:
348           Stay in the after attribute name state.
351           Switch to the self-closing start tag state.
354           Switch to the before attribute value state.
357           Emit the current tag token. Switch to the data state.
360           Start a new attribute in the current tag token. Set that
361           attribute's name to the lowercase version of the current input
362           character (add 0x0020 to the character's code point), and its
363           value to the empty string. Switch to the attribute name state.
367           Parse error. Treat it as per the "anything else" entry below.
370           Parse error. Emit the current tag token. Reconsume the EOF
371           character in the data state.
374           Start a new attribute in the current tag token. Set that
375           attribute's name to the current input character, and its value
376           to the empty string. Switch to the attribute name state.
380    Consume the next input character:
386           Stay in the before attribute value state.
389           Switch to the attribute value (double-quoted) state.
392           Switch to the attribute value (unquoted) state and reconsume
396           Switch to the attribute value (single-quoted) state.
399           Parse error. Emit the current tag token. Switch to the data
403           Parse error. Treat it as per the "anything else" entry below.
406           Parse error. Emit the current tag token. Reconsume the character
407           in the data state.
410           Append the current input character to the current attribute's
411           value. Switch to the attribute value (unquoted) state.
415    Consume the next input character:
418           Switch to the after attribute value (quoted) state.
421           Switch to the character reference in attribute value state, with
422           the additional allowed character being U+0022 QUOTATION MARK
426           Parse error. Emit the current tag token. Reconsume the character
427           in the data state.
430           Append the current input character to the current attribute's
431           value. Stay in the attribute value (double-quoted) state.
435    Consume the next input character:
438           Switch to the after attribute value (quoted) state.
441           Switch to the character reference in attribute value state, with
442           the additional allowed character being U+0027 APOSTROPHE (').
445           Parse error. Emit the current tag token. Reconsume the character
446           in the data state.
449           Append the current input character to the current attribute's
450           value. Stay in the attribute value (single-quoted) state.
454    Consume the next input character:
460           Switch to the before attribute name state.
463           Switch to the character reference in attribute value state, with
467           Emit the current tag token. Switch to the data state.
472           Parse error. Treat it as per the "anything else" entry below.
475           Parse error. Emit the current tag token. Reconsume the character
476           in the data state.
479           Append the current input character to the current attribute's
480           value. Stay in the attribute value (unquoted) state.
486    If nothing is returned, append a U+0026 AMPERSAND character to the
489    Otherwise, append the returned character token to the current
492    Finally, switch back to the attribute value state that you were in when
497    Consume the next input character:
503           Switch to the before attribute name state.
506           Switch to the self-closing start tag state.
509           Emit the current tag token. Switch to the data state.
512           Parse error. Emit the current tag token. Reconsume the EOF
513           character in the data state.
516           Parse error. Reconsume the character in the before attribute
521    Consume the next input character:
524           Set the self-closing flag of the current tag token. Emit the
525           current tag token. Switch to the data state.
528           Parse error. Emit the current tag token. Reconsume the EOF
529           character in the data state.
532           Parse error. Reconsume the character in the before attribute
537    (This can only happen if the content model flag is set to the PCDATA
540    Consume every character up to and including the first U+003E
541    GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
542    comes first. Emit a comment token whose data is the concatenation of
543    all the characters starting from and including the character that
544    caused the state machine to switch into the bogus comment state, up to
545    and including the character immediately before the last consumed
546    character (i.e. up to the character just before the U+003E or EOF
547    character). (If the comment was started by the end of the file (EOF),
548    the token is empty.)
550    Switch to the data state.
552    If the end of the file was reached, reconsume the EOF character.
556    (This can only happen if the content model flag is set to the PCDATA
559    If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
560    consume those two characters, create a comment token whose data is the
561    empty string, and switch to the comment start state.
563    Otherwise, if the next seven characters are an ASCII case-insensitive
564    match for the word "DOCTYPE", then consume those characters and switch
565    to the DOCTYPE state.
567    Otherwise, if the insertion mode is "in foreign content" and the
568    current node is not an element in the HTML namespace and the next seven
569    characters are an ASCII case-sensitive match for the string "[CDATA["
570    (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
572    to the CDATA section state (which is unrelated to the content model
575    Otherwise, this is a parse error. Switch to the bogus comment state.
576    The next character that is consumed, if any, is the first character
577    that will be in the comment.
581    Consume the next input character:
584           Switch to the comment start dash state.
587           Parse error. Emit the comment token. Switch to the data state.
590           Parse error. Emit the comment token. Reconsume the EOF character
591           in the data state.
594           Append the input character to the comment token's data. Switch
595           to the comment state.
599    Consume the next input character:
602           Switch to the comment end state
605           Parse error. Emit the comment token. Switch to the data state.
608           Parse error. Emit the comment token. Reconsume the EOF character
609           in the data state.
612           Append a U+002D HYPHEN-MINUS (-) character and the input
613           character to the comment token's data. Switch to the comment
618    Consume the next input character:
621           Switch to the comment end dash state
624           Parse error. Emit the comment token. Reconsume the EOF character
625           in the data state.
628           Append the input character to the comment token's data. Stay in
629           the comment state.
633    Consume the next input character:
636           Switch to the comment end state
639           Parse error. Emit the comment token. Reconsume the EOF character
640           in the data state.
643           Append a U+002D HYPHEN-MINUS (-) character and the input
644           character to the comment token's data. Switch to the comment
649    Consume the next input character:
652           Emit the comment token. Switch to the data state.
655           Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
656           comment token's data. Stay in the comment end state.
659           Parse error. Emit the comment token. Reconsume the EOF character
660           in the data state.
664           the input character to the comment token's data. Switch to the
669    Consume the next input character:
675           Switch to the before DOCTYPE name state.
678           Parse error. Reconsume the current character in the before
683    Consume the next input character:
689           Stay in the before DOCTYPE name state.
693           flag to on. Emit the token. Switch to the data state.
696           Create a new DOCTYPE token. Set the token's name to the
697           lowercase version of the input character (add 0x0020 to the
698           character's code point). Switch to the DOCTYPE name state.
702           flag to on. Emit the token. Reconsume the EOF character in the
706           Create a new DOCTYPE token. Set the token's name to the current
707           input character. Switch to the DOCTYPE name state.
711    Consume the next input character:
717           Switch to the after DOCTYPE name state.
720           Emit the current DOCTYPE token. Switch to the data state.
723           Append the lowercase version of the input character (add 0x0020
724           to the character's code point) to the current DOCTYPE token's
725           name. Stay in the DOCTYPE name state.
728           Parse error. Set the DOCTYPE token's force-quirks flag to on.
729           Emit that DOCTYPE token. Reconsume the EOF character in the data
733           Append the current input character to the current DOCTYPE
734           token's name. Stay in the DOCTYPE name state.
738    Consume the next input character:
744           Stay in the after DOCTYPE name state.
747           Emit the current DOCTYPE token. Switch to the data state.
750           Parse error. Set the DOCTYPE token's force-quirks flag to on.
751           Emit that DOCTYPE token. Reconsume the EOF character in the data
755           If the six characters starting from the current input character
756           are an ASCII case-insensitive match for the word "PUBLIC", then
757           consume those characters and switch to the before DOCTYPE public
760           Otherwise, if the six characters starting from the current input
761           character are an ASCII case-insensitive match for the word
762           "SYSTEM", then consume those characters and switch to the before
765           Otherwise, this is the parse error. Set the DOCTYPE token's
766           force-quirks flag to on. Switch to the bogus DOCTYPE state.
770    Consume the next input character:
776           Stay in the before DOCTYPE public identifier state.
779           Set the DOCTYPE token's public identifier to the empty string
780           (not missing), then switch to the DOCTYPE public identifier
784           Set the DOCTYPE token's public identifier to the empty string
785           (not missing), then switch to the DOCTYPE public identifier
789           Parse error. Set the DOCTYPE token's force-quirks flag to on.
790           Emit that DOCTYPE token. Switch to the data state.
793           Parse error. Set the DOCTYPE token's force-quirks flag to on.
794           Emit that DOCTYPE token. Reconsume the EOF character in the data
798           Parse error. Set the DOCTYPE token's force-quirks flag to on.
799           Switch to the bogus DOCTYPE state.
803    Consume the next input character:
806           Switch to the after DOCTYPE public identifier state.
809           Parse error. Set the DOCTYPE token's force-quirks flag to on.
810           Emit that DOCTYPE token. Switch to the data state.
813           Parse error. Set the DOCTYPE token's force-quirks flag to on.
814           Emit that DOCTYPE token. Reconsume the EOF character in the data
818           Append the current input character to the current DOCTYPE
819           token's public identifier. Stay in the DOCTYPE public identifier
824    Consume the next input character:
827           Switch to the after DOCTYPE public identifier state.
830           Parse error. Set the DOCTYPE token's force-quirks flag to on.
831           Emit that DOCTYPE token. Switch to the data state.
834           Parse error. Set the DOCTYPE token's force-quirks flag to on.
835           Emit that DOCTYPE token. Reconsume the EOF character in the data
839           Append the current input character to the current DOCTYPE
840           token's public identifier. Stay in the DOCTYPE public identifier
845    Consume the next input character:
851           Stay in the after DOCTYPE public identifier state.
854           Set the DOCTYPE token's system identifier to the empty string
855           (not missing), then switch to the DOCTYPE system identifier
859           Set the DOCTYPE token's system identifier to the empty string
860           (not missing), then switch to the DOCTYPE system identifier
864           Emit the current DOCTYPE token. Switch to the data state.
867           Parse error. Set the DOCTYPE token's force-quirks flag to on.
868           Emit that DOCTYPE token. Reconsume the EOF character in the data
872           Parse error. Set the DOCTYPE token's force-quirks flag to on.
873           Switch to the bogus DOCTYPE state.
877    Consume the next input character:
883           Stay in the before DOCTYPE system identifier state.
886           Set the DOCTYPE token's system identifier to the empty string
887           (not missing), then switch to the DOCTYPE system identifier
891           Set the DOCTYPE token's system identifier to the empty string
892           (not missing), then switch to the DOCTYPE system identifier
896           Parse error. Set the DOCTYPE token's force-quirks flag to on.
897           Emit that DOCTYPE token. Switch to the data state.
900           Parse error. Set the DOCTYPE token's force-quirks flag to on.
901           Emit that DOCTYPE token. Reconsume the EOF character in the data
905           Parse error. Set the DOCTYPE token's force-quirks flag to on.
906           Switch to the bogus DOCTYPE state.
910    Consume the next input character:
913           Switch to the after DOCTYPE system identifier state.
916           Parse error. Set the DOCTYPE token's force-quirks flag to on.
917           Emit that DOCTYPE token. Switch to the data state.
920           Parse error. Set the DOCTYPE token's force-quirks flag to on.
921           Emit that DOCTYPE token. Reconsume the EOF character in the data
925           Append the current input character to the current DOCTYPE
926           token's system identifier. Stay in the DOCTYPE system identifier
931    Consume the next input character:
934           Switch to the after DOCTYPE system identifier state.
937           Parse error. Set the DOCTYPE token's force-quirks flag to on.
938           Emit that DOCTYPE token. Switch to the data state.
941           Parse error. Set the DOCTYPE token's force-quirks flag to on.
942           Emit that DOCTYPE token. Reconsume the EOF character in the data
946           Append the current input character to the current DOCTYPE
947           token's system identifier. Stay in the DOCTYPE system identifier
952    Consume the next input character:
958           Stay in the after DOCTYPE system identifier state.
961           Emit the current DOCTYPE token. Switch to the data state.
964           Parse error. Set the DOCTYPE token's force-quirks flag to on.
965           Emit that DOCTYPE token. Reconsume the EOF character in the data
969           Parse error. Switch to the bogus DOCTYPE state. (This does not
970           set the DOCTYPE token's force-quirks flag to on.)
974    Consume the next input character:
977           Emit the DOCTYPE token. Switch to the data state.
980           Emit the DOCTYPE token. Reconsume the EOF character in the data
984           Stay in the bogus DOCTYPE state.
988    (This can only happen if the content model flag is set to the PCDATA
989    state, and is unrelated to the content model flag's CDATA state.)
991    Consume every character up to the next occurrence of the three
993    BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
995    all the characters consumed except the matching three character
996    sequence at the end (if one was found before the end of the file).
998    Switch to the data state.
1000    If the end of the file was reached, reconsume the EOF character.
1008    The behavior depends on the identity of the next character (the one
1009    immediately after the U+0026 AMPERSAND character):
1018    The additional allowed character, if there is one
1023           Consume the U+0023 NUMBER SIGN.
1025           The behavior further depends on the character after the U+0023
1030                 Consume the X.
1032                 Follow the steps below, but using the range of characters
1038                 When it comes to interpreting the number, interpret it as
1042                 Follow the steps below, but using the range of characters
1046                 When it comes to interpreting the number, interpret it as
1049           Consume as many characters as match the range of characters
1052           If no characters match the range, then don't consume any
1053           characters (and unconsume the U+0023 NUMBER SIGN character and,
1054           if appropriate, the X character). This is a parse error; nothing
1057           Otherwise, if the next character is a U+003B SEMICOLON, consume
1060           If one or more characters match the range, then take them all
1061           and interpret the string of characters as a number (either
1064           If that number is one of the numbers in the first column of the
1065           following table, then this is a parse error. Find the row with
1066           that number in the first column, and return a character token
1067           for the Unicode character given in the second column of that
1105           Otherwise, if the number is in the range 0x0000 to 0x0008,
1113           a parse error; return a character token for the U+FFFD
1116           Otherwise, return a character token for the Unicode character
1120           Consume the maximum number of characters possible, with the
1121           consumed characters matching one of the identifiers in the first
1122           column of the named character references table (in a
1128           If the last character matched is not a U+003B SEMICOLON (;),
1131           If the character reference is being consumed as part of an
1132           attribute, and the last character matched is not a U+003B
1133           SEMICOLON (;), and the next character is in the range U+0030
1137           all the characters that were matched after the U+0026 AMPERSAND
1140           Otherwise, return a character token for the character
1141           corresponding to the character reference name (as given by the
1142           second column of the named character references table).
1144           If the markup contains I'm &notit; I tell you, the character
1146           the markup was I'm &notin; I tell you, the character reference