• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // Protocol Buffers - Google's data interchange format
2 // Copyright 2008 Google Inc.  All rights reserved.
3 // https://developers.google.com/protocol-buffers/
4 //
5 // Redistribution and use in source and binary forms, with or without
6 // modification, are permitted provided that the following conditions are
7 // met:
8 //
9 //     * Redistributions of source code must retain the above copyright
10 // notice, this list of conditions and the following disclaimer.
11 //     * Redistributions in binary form must reproduce the above
12 // copyright notice, this list of conditions and the following disclaimer
13 // in the documentation and/or other materials provided with the
14 // distribution.
15 //     * Neither the name of Google Inc. nor the names of its
16 // contributors may be used to endorse or promote products derived from
17 // this software without specific prior written permission.
18 //
19 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20 // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21 // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22 // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23 // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24 // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25 // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27 // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 
31 // Author: kenton@google.com (Kenton Varda)
32 //  Based on original Protocol Buffers design by
33 //  Sanjay Ghemawat, Jeff Dean, and others.
34 //
35 // Here we have a hand-written lexer.  At first you might ask yourself,
36 // "Hand-written text processing?  Is Kenton crazy?!"  Well, first of all,
37 // yes I am crazy, but that's beside the point.  There are actually reasons
38 // why I ended up writing this this way.
39 //
40 // The traditional approach to lexing is to use lex to generate a lexer for
41 // you.  Unfortunately, lex's output is ridiculously ugly and difficult to
42 // integrate cleanly with C++ code, especially abstract code or code meant
43 // as a library.  Better parser-generators exist but would add dependencies
44 // which most users won't already have, which we'd like to avoid.  (GNU flex
45 // has a C++ output option, but it's still ridiculously ugly, non-abstract,
46 // and not library-friendly.)
47 //
48 // The next approach that any good software engineer should look at is to
49 // use regular expressions.  And, indeed, I did.  I have code which
50 // implements this same class using regular expressions.  It's about 200
51 // lines shorter.  However:
52 // - Rather than error messages telling you "This string has an invalid
53 //   escape sequence at line 5, column 45", you get error messages like
54 //   "Parse error on line 5".  Giving more precise errors requires adding
55 //   a lot of code that ends up basically as complex as the hand-coded
56 //   version anyway.
57 // - The regular expression to match a string literal looks like this:
58 //     kString  = new RE("(\"([^\"\\\\]|"              // non-escaped
59 //                       "\\\\[abfnrtv?\"'\\\\0-7]|"   // normal escape
60 //                       "\\\\x[0-9a-fA-F])*\"|"       // hex escape
61 //                       "\'([^\'\\\\]|"        // Also support single-quotes.
62 //                       "\\\\[abfnrtv?\"'\\\\0-7]|"
63 //                       "\\\\x[0-9a-fA-F])*\')");
64 //   Verifying the correctness of this line noise is actually harder than
65 //   verifying the correctness of ConsumeString(), defined below.  I'm not
66 //   even confident that the above is correct, after staring at it for some
67 //   time.
68 // - PCRE is fast, but there's still more overhead involved than the code
69 //   below.
70 // - Sadly, regular expressions are not part of the C standard library, so
71 //   using them would require depending on some other library.  For the
72 //   open source release, this could be really annoying.  Nobody likes
73 //   downloading one piece of software just to find that they need to
74 //   download something else to make it work, and in all likelihood
75 //   people downloading Protocol Buffers will already be doing so just
76 //   to make something else work.  We could include a copy of PCRE with
77 //   our code, but that obligates us to keep it up-to-date and just seems
78 //   like a big waste just to save 200 lines of code.
79 //
80 // On a similar but unrelated note, I'm even scared to use ctype.h.
81 // Apparently functions like isalpha() are locale-dependent.  So, if we used
82 // that, then if this code is being called from some program that doesn't
83 // have its locale set to "C", it would behave strangely.  We can't just set
84 // the locale to "C" ourselves since we might break the calling program that
85 // way, particularly if it is multi-threaded.  WTF?  Someone please let me
86 // (Kenton) know if I'm missing something here...
87 //
88 // I'd love to hear about other alternatives, though, as this code isn't
89 // exactly pretty.
90 
91 #include <google/protobuf/io/tokenizer.h>
92 
93 #include <google/protobuf/stubs/common.h>
94 #include <google/protobuf/stubs/logging.h>
95 #include <google/protobuf/stubs/strutil.h>
96 #include <google/protobuf/stubs/stringprintf.h>
97 #include <google/protobuf/io/strtod.h>
98 #include <google/protobuf/io/zero_copy_stream.h>
99 #include <google/protobuf/stubs/stl_util.h>
100 
101 namespace google {
102 namespace protobuf {
103 namespace io {
104 namespace {
105 
106 // As mentioned above, I don't trust ctype.h due to the presence of "locales".
107 // So, I have written replacement functions here.  Someone please smack me if
108 // this is a bad idea or if there is some way around this.
109 //
110 // These "character classes" are designed to be used in template methods.
111 // For instance, Tokenizer::ConsumeZeroOrMore<Whitespace>() will eat
112 // whitespace.
113 
114 // Note:  No class is allowed to contain '\0', since this is used to mark end-
115 //   of-input and is handled specially.
116 
117 #define CHARACTER_CLASS(NAME, EXPRESSION)                     \
118   class NAME {                                                \
119    public:                                                    \
120     static inline bool InClass(char c) { return EXPRESSION; } \
121   }
122 
123 CHARACTER_CLASS(Whitespace, c == ' ' || c == '\n' || c == '\t' || c == '\r' ||
124                                 c == '\v' || c == '\f');
125 CHARACTER_CLASS(WhitespaceNoNewline,
126                 c == ' ' || c == '\t' || c == '\r' || c == '\v' || c == '\f');
127 
128 CHARACTER_CLASS(Unprintable, c<' ' && c> '\0');
129 
130 CHARACTER_CLASS(Digit, '0' <= c && c <= '9');
131 CHARACTER_CLASS(OctalDigit, '0' <= c && c <= '7');
132 CHARACTER_CLASS(HexDigit, ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||
133                               ('A' <= c && c <= 'F'));
134 
135 CHARACTER_CLASS(Letter,
136                 ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || (c == '_'));
137 
138 CHARACTER_CLASS(Alphanumeric, ('a' <= c && c <= 'z') ||
139                                   ('A' <= c && c <= 'Z') ||
140                                   ('0' <= c && c <= '9') || (c == '_'));
141 
142 CHARACTER_CLASS(Escape, c == 'a' || c == 'b' || c == 'f' || c == 'n' ||
143                             c == 'r' || c == 't' || c == 'v' || c == '\\' ||
144                             c == '?' || c == '\'' || c == '\"');
145 
146 #undef CHARACTER_CLASS
147 
148 // Given a char, interpret it as a numeric digit and return its value.
149 // This supports any number base up to 36.
DigitValue(char digit)150 inline int DigitValue(char digit) {
151   if ('0' <= digit && digit <= '9') return digit - '0';
152   if ('a' <= digit && digit <= 'z') return digit - 'a' + 10;
153   if ('A' <= digit && digit <= 'Z') return digit - 'A' + 10;
154   return -1;
155 }
156 
157 // Inline because it's only used in one place.
TranslateEscape(char c)158 inline char TranslateEscape(char c) {
159   switch (c) {
160     case 'a':
161       return '\a';
162     case 'b':
163       return '\b';
164     case 'f':
165       return '\f';
166     case 'n':
167       return '\n';
168     case 'r':
169       return '\r';
170     case 't':
171       return '\t';
172     case 'v':
173       return '\v';
174     case '\\':
175       return '\\';
176     case '?':
177       return '\?';  // Trigraphs = :(
178     case '\'':
179       return '\'';
180     case '"':
181       return '\"';
182 
183     // We expect escape sequences to have been validated separately.
184     default:
185       return '?';
186   }
187 }
188 
189 }  // anonymous namespace
190 
~ErrorCollector()191 ErrorCollector::~ErrorCollector() {}
192 
193 // ===================================================================
194 
Tokenizer(ZeroCopyInputStream * input,ErrorCollector * error_collector)195 Tokenizer::Tokenizer(ZeroCopyInputStream* input,
196                      ErrorCollector* error_collector)
197     : input_(input),
198       error_collector_(error_collector),
199       buffer_(NULL),
200       buffer_size_(0),
201       buffer_pos_(0),
202       read_error_(false),
203       line_(0),
204       column_(0),
205       record_target_(NULL),
206       record_start_(-1),
207       allow_f_after_float_(false),
208       comment_style_(CPP_COMMENT_STYLE),
209       require_space_after_number_(true),
210       allow_multiline_strings_(false) {
211   current_.line = 0;
212   current_.column = 0;
213   current_.end_column = 0;
214   current_.type = TYPE_START;
215 
216   Refresh();
217 }
218 
~Tokenizer()219 Tokenizer::~Tokenizer() {
220   // If we had any buffer left unread, return it to the underlying stream
221   // so that someone else can read it.
222   if (buffer_size_ > buffer_pos_) {
223     input_->BackUp(buffer_size_ - buffer_pos_);
224   }
225 }
226 
report_whitespace() const227 bool Tokenizer::report_whitespace() const { return report_whitespace_; }
228 // Note: `set_report_whitespace(false)` implies `set_report_newlines(false)`.
set_report_whitespace(bool report)229 void Tokenizer::set_report_whitespace(bool report) {
230   report_whitespace_ = report;
231   report_newlines_ &= report;
232 }
233 
234 // If true, newline tokens are reported by Next().
report_newlines() const235 bool Tokenizer::report_newlines() const { return report_newlines_; }
236 // Note: `set_report_newlines(true)` implies `set_report_whitespace(true)`.
set_report_newlines(bool report)237 void Tokenizer::set_report_newlines(bool report) {
238   report_newlines_ = report;
239   report_whitespace_ |= report;  // enable report_whitespace if necessary
240 }
241 
242 // -------------------------------------------------------------------
243 // Internal helpers.
244 
NextChar()245 void Tokenizer::NextChar() {
246   // Update our line and column counters based on the character being
247   // consumed.
248   if (current_char_ == '\n') {
249     ++line_;
250     column_ = 0;
251   } else if (current_char_ == '\t') {
252     column_ += kTabWidth - column_ % kTabWidth;
253   } else {
254     ++column_;
255   }
256 
257   // Advance to the next character.
258   ++buffer_pos_;
259   if (buffer_pos_ < buffer_size_) {
260     current_char_ = buffer_[buffer_pos_];
261   } else {
262     Refresh();
263   }
264 }
265 
Refresh()266 void Tokenizer::Refresh() {
267   if (read_error_) {
268     current_char_ = '\0';
269     return;
270   }
271 
272   // If we're in a token, append the rest of the buffer to it.
273   if (record_target_ != NULL && record_start_ < buffer_size_) {
274     record_target_->append(buffer_ + record_start_,
275                            buffer_size_ - record_start_);
276     record_start_ = 0;
277   }
278 
279   const void* data = NULL;
280   buffer_ = NULL;
281   buffer_pos_ = 0;
282   do {
283     if (!input_->Next(&data, &buffer_size_)) {
284       // end of stream (or read error)
285       buffer_size_ = 0;
286       read_error_ = true;
287       current_char_ = '\0';
288       return;
289     }
290   } while (buffer_size_ == 0);
291 
292   buffer_ = static_cast<const char*>(data);
293 
294   current_char_ = buffer_[0];
295 }
296 
RecordTo(std::string * target)297 inline void Tokenizer::RecordTo(std::string* target) {
298   record_target_ = target;
299   record_start_ = buffer_pos_;
300 }
301 
StopRecording()302 inline void Tokenizer::StopRecording() {
303   // Note:  The if() is necessary because some STL implementations crash when
304   //   you call string::append(NULL, 0), presumably because they are trying to
305   //   be helpful by detecting the NULL pointer, even though there's nothing
306   //   wrong with reading zero bytes from NULL.
307   if (buffer_pos_ != record_start_) {
308     record_target_->append(buffer_ + record_start_,
309                            buffer_pos_ - record_start_);
310   }
311   record_target_ = NULL;
312   record_start_ = -1;
313 }
314 
StartToken()315 inline void Tokenizer::StartToken() {
316   current_.type = TYPE_START;  // Just for the sake of initializing it.
317   current_.text.clear();
318   current_.line = line_;
319   current_.column = column_;
320   RecordTo(&current_.text);
321 }
322 
EndToken()323 inline void Tokenizer::EndToken() {
324   StopRecording();
325   current_.end_column = column_;
326 }
327 
328 // -------------------------------------------------------------------
329 // Helper methods that consume characters.
330 
331 template <typename CharacterClass>
LookingAt()332 inline bool Tokenizer::LookingAt() {
333   return CharacterClass::InClass(current_char_);
334 }
335 
336 template <typename CharacterClass>
TryConsumeOne()337 inline bool Tokenizer::TryConsumeOne() {
338   if (CharacterClass::InClass(current_char_)) {
339     NextChar();
340     return true;
341   } else {
342     return false;
343   }
344 }
345 
TryConsume(char c)346 inline bool Tokenizer::TryConsume(char c) {
347   if (current_char_ == c) {
348     NextChar();
349     return true;
350   } else {
351     return false;
352   }
353 }
354 
355 template <typename CharacterClass>
ConsumeZeroOrMore()356 inline void Tokenizer::ConsumeZeroOrMore() {
357   while (CharacterClass::InClass(current_char_)) {
358     NextChar();
359   }
360 }
361 
362 template <typename CharacterClass>
ConsumeOneOrMore(const char * error)363 inline void Tokenizer::ConsumeOneOrMore(const char* error) {
364   if (!CharacterClass::InClass(current_char_)) {
365     AddError(error);
366   } else {
367     do {
368       NextChar();
369     } while (CharacterClass::InClass(current_char_));
370   }
371 }
372 
373 // -------------------------------------------------------------------
374 // Methods that read whole patterns matching certain kinds of tokens
375 // or comments.
376 
ConsumeString(char delimiter)377 void Tokenizer::ConsumeString(char delimiter) {
378   while (true) {
379     switch (current_char_) {
380       case '\0':
381         AddError("Unexpected end of string.");
382         return;
383 
384       case '\n': {
385         if (!allow_multiline_strings_) {
386           AddError("String literals cannot cross line boundaries.");
387           return;
388         }
389         NextChar();
390         break;
391       }
392 
393       case '\\': {
394         // An escape sequence.
395         NextChar();
396         if (TryConsumeOne<Escape>()) {
397           // Valid escape sequence.
398         } else if (TryConsumeOne<OctalDigit>()) {
399           // Possibly followed by two more octal digits, but these will
400           // just be consumed by the main loop anyway so we don't need
401           // to do so explicitly here.
402         } else if (TryConsume('x')) {
403           if (!TryConsumeOne<HexDigit>()) {
404             AddError("Expected hex digits for escape sequence.");
405           }
406           // Possibly followed by another hex digit, but again we don't care.
407         } else if (TryConsume('u')) {
408           if (!TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
409               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>()) {
410             AddError("Expected four hex digits for \\u escape sequence.");
411           }
412         } else if (TryConsume('U')) {
413           // We expect 8 hex digits; but only the range up to 0x10ffff is
414           // legal.
415           if (!TryConsume('0') || !TryConsume('0') ||
416               !(TryConsume('0') || TryConsume('1')) ||
417               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
418               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
419               !TryConsumeOne<HexDigit>()) {
420             AddError(
421                 "Expected eight hex digits up to 10ffff for \\U escape "
422                 "sequence");
423           }
424         } else {
425           AddError("Invalid escape sequence in string literal.");
426         }
427         break;
428       }
429 
430       default: {
431         if (current_char_ == delimiter) {
432           NextChar();
433           return;
434         }
435         NextChar();
436         break;
437       }
438     }
439   }
440 }
441 
ConsumeNumber(bool started_with_zero,bool started_with_dot)442 Tokenizer::TokenType Tokenizer::ConsumeNumber(bool started_with_zero,
443                                               bool started_with_dot) {
444   bool is_float = false;
445 
446   if (started_with_zero && (TryConsume('x') || TryConsume('X'))) {
447     // A hex number (started with "0x").
448     ConsumeOneOrMore<HexDigit>("\"0x\" must be followed by hex digits.");
449 
450   } else if (started_with_zero && LookingAt<Digit>()) {
451     // An octal number (had a leading zero).
452     ConsumeZeroOrMore<OctalDigit>();
453     if (LookingAt<Digit>()) {
454       AddError("Numbers starting with leading zero must be in octal.");
455       ConsumeZeroOrMore<Digit>();
456     }
457 
458   } else {
459     // A decimal number.
460     if (started_with_dot) {
461       is_float = true;
462       ConsumeZeroOrMore<Digit>();
463     } else {
464       ConsumeZeroOrMore<Digit>();
465 
466       if (TryConsume('.')) {
467         is_float = true;
468         ConsumeZeroOrMore<Digit>();
469       }
470     }
471 
472     if (TryConsume('e') || TryConsume('E')) {
473       is_float = true;
474       TryConsume('-') || TryConsume('+');
475       ConsumeOneOrMore<Digit>("\"e\" must be followed by exponent.");
476     }
477 
478     if (allow_f_after_float_ && (TryConsume('f') || TryConsume('F'))) {
479       is_float = true;
480     }
481   }
482 
483   if (LookingAt<Letter>() && require_space_after_number_) {
484     AddError("Need space between number and identifier.");
485   } else if (current_char_ == '.') {
486     if (is_float) {
487       AddError(
488           "Already saw decimal point or exponent; can't have another one.");
489     } else {
490       AddError("Hex and octal numbers must be integers.");
491     }
492   }
493 
494   return is_float ? TYPE_FLOAT : TYPE_INTEGER;
495 }
496 
ConsumeLineComment(std::string * content)497 void Tokenizer::ConsumeLineComment(std::string* content) {
498   if (content != NULL) RecordTo(content);
499 
500   while (current_char_ != '\0' && current_char_ != '\n') {
501     NextChar();
502   }
503   TryConsume('\n');
504 
505   if (content != NULL) StopRecording();
506 }
507 
ConsumeBlockComment(std::string * content)508 void Tokenizer::ConsumeBlockComment(std::string* content) {
509   int start_line = line_;
510   int start_column = column_ - 2;
511 
512   if (content != NULL) RecordTo(content);
513 
514   while (true) {
515     while (current_char_ != '\0' && current_char_ != '*' &&
516            current_char_ != '/' && current_char_ != '\n') {
517       NextChar();
518     }
519 
520     if (TryConsume('\n')) {
521       if (content != NULL) StopRecording();
522 
523       // Consume leading whitespace and asterisk;
524       ConsumeZeroOrMore<WhitespaceNoNewline>();
525       if (TryConsume('*')) {
526         if (TryConsume('/')) {
527           // End of comment.
528           break;
529         }
530       }
531 
532       if (content != NULL) RecordTo(content);
533     } else if (TryConsume('*') && TryConsume('/')) {
534       // End of comment.
535       if (content != NULL) {
536         StopRecording();
537         // Strip trailing "*/".
538         content->erase(content->size() - 2);
539       }
540       break;
541     } else if (TryConsume('/') && current_char_ == '*') {
542       // Note:  We didn't consume the '*' because if there is a '/' after it
543       //   we want to interpret that as the end of the comment.
544       AddError(
545           "\"/*\" inside block comment.  Block comments cannot be nested.");
546     } else if (current_char_ == '\0') {
547       AddError("End-of-file inside block comment.");
548       error_collector_->AddError(start_line, start_column,
549                                  "  Comment started here.");
550       if (content != NULL) StopRecording();
551       break;
552     }
553   }
554 }
555 
TryConsumeCommentStart()556 Tokenizer::NextCommentStatus Tokenizer::TryConsumeCommentStart() {
557   if (comment_style_ == CPP_COMMENT_STYLE && TryConsume('/')) {
558     if (TryConsume('/')) {
559       return LINE_COMMENT;
560     } else if (TryConsume('*')) {
561       return BLOCK_COMMENT;
562     } else {
563       // Oops, it was just a slash.  Return it.
564       current_.type = TYPE_SYMBOL;
565       current_.text = "/";
566       current_.line = line_;
567       current_.column = column_ - 1;
568       current_.end_column = column_;
569       return SLASH_NOT_COMMENT;
570     }
571   } else if (comment_style_ == SH_COMMENT_STYLE && TryConsume('#')) {
572     return LINE_COMMENT;
573   } else {
574     return NO_COMMENT;
575   }
576 }
577 
TryConsumeWhitespace()578 bool Tokenizer::TryConsumeWhitespace() {
579   if (report_newlines_) {
580     if (TryConsumeOne<WhitespaceNoNewline>()) {
581       ConsumeZeroOrMore<WhitespaceNoNewline>();
582       current_.type = TYPE_WHITESPACE;
583       return true;
584     }
585     return false;
586   }
587   if (TryConsumeOne<Whitespace>()) {
588     ConsumeZeroOrMore<Whitespace>();
589     current_.type = TYPE_WHITESPACE;
590     return report_whitespace_;
591   }
592   return false;
593 }
594 
TryConsumeNewline()595 bool Tokenizer::TryConsumeNewline() {
596   if (!report_whitespace_ || !report_newlines_) {
597     return false;
598   }
599   if (TryConsume('\n')) {
600     current_.type = TYPE_NEWLINE;
601     return true;
602   }
603   return false;
604 }
605 
606 // -------------------------------------------------------------------
607 
Next()608 bool Tokenizer::Next() {
609   previous_ = current_;
610 
611   while (!read_error_) {
612     StartToken();
613     bool report_token = TryConsumeWhitespace() || TryConsumeNewline();
614     EndToken();
615     if (report_token) {
616       return true;
617     }
618 
619     switch (TryConsumeCommentStart()) {
620       case LINE_COMMENT:
621         ConsumeLineComment(NULL);
622         continue;
623       case BLOCK_COMMENT:
624         ConsumeBlockComment(NULL);
625         continue;
626       case SLASH_NOT_COMMENT:
627         return true;
628       case NO_COMMENT:
629         break;
630     }
631 
632     // Check for EOF before continuing.
633     if (read_error_) break;
634 
635     if (LookingAt<Unprintable>() || current_char_ == '\0') {
636       AddError("Invalid control characters encountered in text.");
637       NextChar();
638       // Skip more unprintable characters, too.  But, remember that '\0' is
639       // also what current_char_ is set to after EOF / read error.  We have
640       // to be careful not to go into an infinite loop of trying to consume
641       // it, so make sure to check read_error_ explicitly before consuming
642       // '\0'.
643       while (TryConsumeOne<Unprintable>() ||
644              (!read_error_ && TryConsume('\0'))) {
645         // Ignore.
646       }
647 
648     } else {
649       // Reading some sort of token.
650       StartToken();
651 
652       if (TryConsumeOne<Letter>()) {
653         ConsumeZeroOrMore<Alphanumeric>();
654         current_.type = TYPE_IDENTIFIER;
655       } else if (TryConsume('0')) {
656         current_.type = ConsumeNumber(true, false);
657       } else if (TryConsume('.')) {
658         // This could be the beginning of a floating-point number, or it could
659         // just be a '.' symbol.
660 
661         if (TryConsumeOne<Digit>()) {
662           // It's a floating-point number.
663           if (previous_.type == TYPE_IDENTIFIER &&
664               current_.line == previous_.line &&
665               current_.column == previous_.end_column) {
666             // We don't accept syntax like "blah.123".
667             error_collector_->AddError(
668                 line_, column_ - 2,
669                 "Need space between identifier and decimal point.");
670           }
671           current_.type = ConsumeNumber(false, true);
672         } else {
673           current_.type = TYPE_SYMBOL;
674         }
675       } else if (TryConsumeOne<Digit>()) {
676         current_.type = ConsumeNumber(false, false);
677       } else if (TryConsume('\"')) {
678         ConsumeString('\"');
679         current_.type = TYPE_STRING;
680       } else if (TryConsume('\'')) {
681         ConsumeString('\'');
682         current_.type = TYPE_STRING;
683       } else {
684         // Check if the high order bit is set.
685         if (current_char_ & 0x80) {
686           error_collector_->AddError(
687               line_, column_,
688               StringPrintf("Interpreting non ascii codepoint %d.",
689                               static_cast<unsigned char>(current_char_)));
690         }
691         NextChar();
692         current_.type = TYPE_SYMBOL;
693       }
694 
695       EndToken();
696       return true;
697     }
698   }
699 
700   // EOF
701   current_.type = TYPE_END;
702   current_.text.clear();
703   current_.line = line_;
704   current_.column = column_;
705   current_.end_column = column_;
706   return false;
707 }
708 
709 namespace {
710 
711 // Helper class for collecting comments and putting them in the right places.
712 //
713 // This basically just buffers the most recent comment until it can be decided
714 // exactly where that comment should be placed.  When Flush() is called, the
715 // current comment goes into either prev_trailing_comments or detached_comments.
716 // When the CommentCollector is destroyed, the last buffered comment goes into
717 // next_leading_comments.
718 class CommentCollector {
719  public:
CommentCollector(std::string * prev_trailing_comments,std::vector<std::string> * detached_comments,std::string * next_leading_comments)720   CommentCollector(std::string* prev_trailing_comments,
721                    std::vector<std::string>* detached_comments,
722                    std::string* next_leading_comments)
723       : prev_trailing_comments_(prev_trailing_comments),
724         detached_comments_(detached_comments),
725         next_leading_comments_(next_leading_comments),
726         has_comment_(false),
727         is_line_comment_(false),
728         can_attach_to_prev_(true) {
729     if (prev_trailing_comments != NULL) prev_trailing_comments->clear();
730     if (detached_comments != NULL) detached_comments->clear();
731     if (next_leading_comments != NULL) next_leading_comments->clear();
732   }
733 
~CommentCollector()734   ~CommentCollector() {
735     // Whatever is in the buffer is a leading comment.
736     if (next_leading_comments_ != NULL && has_comment_) {
737       comment_buffer_.swap(*next_leading_comments_);
738     }
739   }
740 
741   // About to read a line comment.  Get the comment buffer pointer in order to
742   // read into it.
GetBufferForLineComment()743   std::string* GetBufferForLineComment() {
744     // We want to combine with previous line comments, but not block comments.
745     if (has_comment_ && !is_line_comment_) {
746       Flush();
747     }
748     has_comment_ = true;
749     is_line_comment_ = true;
750     return &comment_buffer_;
751   }
752 
753   // About to read a block comment.  Get the comment buffer pointer in order to
754   // read into it.
GetBufferForBlockComment()755   std::string* GetBufferForBlockComment() {
756     if (has_comment_) {
757       Flush();
758     }
759     has_comment_ = true;
760     is_line_comment_ = false;
761     return &comment_buffer_;
762   }
763 
ClearBuffer()764   void ClearBuffer() {
765     comment_buffer_.clear();
766     has_comment_ = false;
767   }
768 
769   // Called once we know that the comment buffer is complete and is *not*
770   // connected to the next token.
Flush()771   void Flush() {
772     if (has_comment_) {
773       if (can_attach_to_prev_) {
774         if (prev_trailing_comments_ != NULL) {
775           prev_trailing_comments_->append(comment_buffer_);
776         }
777         can_attach_to_prev_ = false;
778       } else {
779         if (detached_comments_ != NULL) {
780           detached_comments_->push_back(comment_buffer_);
781         }
782       }
783       ClearBuffer();
784     }
785   }
786 
DetachFromPrev()787   void DetachFromPrev() { can_attach_to_prev_ = false; }
788 
789  private:
790   std::string* prev_trailing_comments_;
791   std::vector<std::string>* detached_comments_;
792   std::string* next_leading_comments_;
793 
794   std::string comment_buffer_;
795 
796   // True if any comments were read into comment_buffer_.  This can be true even
797   // if comment_buffer_ is empty, namely if the comment was "/**/".
798   bool has_comment_;
799 
800   // Is the comment in the comment buffer a line comment?
801   bool is_line_comment_;
802 
803   // Is it still possible that we could be reading a comment attached to the
804   // previous token?
805   bool can_attach_to_prev_;
806 };
807 
808 }  // namespace
809 
NextWithComments(std::string * prev_trailing_comments,std::vector<std::string> * detached_comments,std::string * next_leading_comments)810 bool Tokenizer::NextWithComments(std::string* prev_trailing_comments,
811                                  std::vector<std::string>* detached_comments,
812                                  std::string* next_leading_comments) {
813   CommentCollector collector(prev_trailing_comments, detached_comments,
814                              next_leading_comments);
815 
816   if (current_.type == TYPE_START) {
817     // Ignore unicode byte order mark(BOM) if it appears at the file
818     // beginning. Only UTF-8 BOM (0xEF 0xBB 0xBF) is accepted.
819     if (TryConsume(static_cast<char>(0xEF))) {
820       if (!TryConsume(static_cast<char>(0xBB)) ||
821           !TryConsume(static_cast<char>(0xBF))) {
822         AddError(
823             "Proto file starts with 0xEF but not UTF-8 BOM. "
824             "Only UTF-8 is accepted for proto file.");
825         return false;
826       }
827     }
828     collector.DetachFromPrev();
829   } else {
830     // A comment appearing on the same line must be attached to the previous
831     // declaration.
832     ConsumeZeroOrMore<WhitespaceNoNewline>();
833     switch (TryConsumeCommentStart()) {
834       case LINE_COMMENT:
835         ConsumeLineComment(collector.GetBufferForLineComment());
836 
837         // Don't allow comments on subsequent lines to be attached to a trailing
838         // comment.
839         collector.Flush();
840         break;
841       case BLOCK_COMMENT:
842         ConsumeBlockComment(collector.GetBufferForBlockComment());
843 
844         ConsumeZeroOrMore<WhitespaceNoNewline>();
845         if (!TryConsume('\n')) {
846           // Oops, the next token is on the same line.  If we recorded a comment
847           // we really have no idea which token it should be attached to.
848           collector.ClearBuffer();
849           return Next();
850         }
851 
852         // Don't allow comments on subsequent lines to be attached to a trailing
853         // comment.
854         collector.Flush();
855         break;
856       case SLASH_NOT_COMMENT:
857         return true;
858       case NO_COMMENT:
859         if (!TryConsume('\n')) {
860           // The next token is on the same line.  There are no comments.
861           return Next();
862         }
863         break;
864     }
865   }
866 
867   // OK, we are now on the line *after* the previous token.
868   while (true) {
869     ConsumeZeroOrMore<WhitespaceNoNewline>();
870 
871     switch (TryConsumeCommentStart()) {
872       case LINE_COMMENT:
873         ConsumeLineComment(collector.GetBufferForLineComment());
874         break;
875       case BLOCK_COMMENT:
876         ConsumeBlockComment(collector.GetBufferForBlockComment());
877 
878         // Consume the rest of the line so that we don't interpret it as a
879         // blank line the next time around the loop.
880         ConsumeZeroOrMore<WhitespaceNoNewline>();
881         TryConsume('\n');
882         break;
883       case SLASH_NOT_COMMENT:
884         return true;
885       case NO_COMMENT:
886         if (TryConsume('\n')) {
887           // Completely blank line.
888           collector.Flush();
889           collector.DetachFromPrev();
890         } else {
891           bool result = Next();
892           if (!result || current_.text == "}" || current_.text == "]" ||
893               current_.text == ")") {
894             // It looks like we're at the end of a scope.  In this case it
895             // makes no sense to attach a comment to the following token.
896             collector.Flush();
897           }
898           return result;
899         }
900         break;
901     }
902   }
903 }
904 
905 // -------------------------------------------------------------------
906 // Token-parsing helpers.  Remember that these don't need to report
907 // errors since any errors should already have been reported while
908 // tokenizing.  Also, these can assume that whatever text they
909 // are given is text that the tokenizer actually parsed as a token
910 // of the given type.
911 
ParseInteger(const std::string & text,uint64_t max_value,uint64_t * output)912 bool Tokenizer::ParseInteger(const std::string& text, uint64_t max_value,
913                              uint64_t* output) {
914   // Sadly, we can't just use strtoul() since it is only 32-bit and strtoull()
915   // is non-standard.  I hate the C standard library.  :(
916 
917   //  return strtoull(text.c_str(), NULL, 0);
918 
919   const char* ptr = text.c_str();
920   int base = 10;
921   if (ptr[0] == '0') {
922     if (ptr[1] == 'x' || ptr[1] == 'X') {
923       // This is hex.
924       base = 16;
925       ptr += 2;
926     } else {
927       // This is octal.
928       base = 8;
929     }
930   }
931 
932   uint64_t result = 0;
933   for (; *ptr != '\0'; ptr++) {
934     int digit = DigitValue(*ptr);
935     if (digit < 0 || digit >= base) {
936       // The token provided by Tokenizer is invalid. i.e., 099 is an invalid
937       // token, but Tokenizer still think it's integer.
938       return false;
939     }
940     if (static_cast<uint64_t>(digit) > max_value ||
941         result > (max_value - digit) / base) {
942       // Overflow.
943       return false;
944     }
945     result = result * base + digit;
946   }
947 
948   *output = result;
949   return true;
950 }
951 
ParseFloat(const std::string & text)952 double Tokenizer::ParseFloat(const std::string& text) {
953   const char* start = text.c_str();
954   char* end;
955   double result = NoLocaleStrtod(start, &end);
956 
957   // "1e" is not a valid float, but if the tokenizer reads it, it will
958   // report an error but still return it as a valid token.  We need to
959   // accept anything the tokenizer could possibly return, error or not.
960   if (*end == 'e' || *end == 'E') {
961     ++end;
962     if (*end == '-' || *end == '+') ++end;
963   }
964 
965   // If the Tokenizer had allow_f_after_float_ enabled, the float may be
966   // suffixed with the letter 'f'.
967   if (*end == 'f' || *end == 'F') {
968     ++end;
969   }
970 
971   GOOGLE_LOG_IF(DFATAL,
972          static_cast<size_t>(end - start) != text.size() || *start == '-')
973       << " Tokenizer::ParseFloat() passed text that could not have been"
974          " tokenized as a float: "
975       << CEscape(text);
976   return result;
977 }
978 
979 // Helper to append a Unicode code point to a string as UTF8, without bringing
980 // in any external dependencies.
AppendUTF8(uint32_t code_point,std::string * output)981 static void AppendUTF8(uint32_t code_point, std::string* output) {
982   uint32_t tmp = 0;
983   int len = 0;
984   if (code_point <= 0x7f) {
985     tmp = code_point;
986     len = 1;
987   } else if (code_point <= 0x07ff) {
988     tmp = 0x0000c080 | ((code_point & 0x07c0) << 2) | (code_point & 0x003f);
989     len = 2;
990   } else if (code_point <= 0xffff) {
991     tmp = 0x00e08080 | ((code_point & 0xf000) << 4) |
992           ((code_point & 0x0fc0) << 2) | (code_point & 0x003f);
993     len = 3;
994   } else if (code_point <= 0x10ffff) {
995     tmp = 0xf0808080 | ((code_point & 0x1c0000) << 6) |
996           ((code_point & 0x03f000) << 4) | ((code_point & 0x000fc0) << 2) |
997           (code_point & 0x003f);
998     len = 4;
999   } else {
1000     // Unicode code points end at 0x10FFFF, so this is out-of-range.
1001     // ConsumeString permits hex values up to 0x1FFFFF, and FetchUnicodePoint
1002     // doesn't perform a range check.
1003     StringAppendF(output, "\\U%08x", code_point);
1004     return;
1005   }
1006   tmp = ghtonl(tmp);
1007   output->append(reinterpret_cast<const char*>(&tmp) + sizeof(tmp) - len, len);
1008 }
1009 
1010 // Try to read <len> hex digits from ptr, and stuff the numeric result into
1011 // *result. Returns true if that many digits were successfully consumed.
ReadHexDigits(const char * ptr,int len,uint32_t * result)1012 static bool ReadHexDigits(const char* ptr, int len, uint32_t* result) {
1013   *result = 0;
1014   if (len == 0) return false;
1015   for (const char* end = ptr + len; ptr < end; ++ptr) {
1016     if (*ptr == '\0') return false;
1017     *result = (*result << 4) + DigitValue(*ptr);
1018   }
1019   return true;
1020 }
1021 
1022 // Handling UTF-16 surrogate pairs. UTF-16 encodes code points in the range
1023 // 0x10000...0x10ffff as a pair of numbers, a head surrogate followed by a trail
1024 // surrogate. These numbers are in a reserved range of Unicode code points, so
1025 // if we encounter such a pair we know how to parse it and convert it into a
1026 // single code point.
1027 static const uint32_t kMinHeadSurrogate = 0xd800;
1028 static const uint32_t kMaxHeadSurrogate = 0xdc00;
1029 static const uint32_t kMinTrailSurrogate = 0xdc00;
1030 static const uint32_t kMaxTrailSurrogate = 0xe000;
1031 
IsHeadSurrogate(uint32_t code_point)1032 static inline bool IsHeadSurrogate(uint32_t code_point) {
1033   return (code_point >= kMinHeadSurrogate) && (code_point < kMaxHeadSurrogate);
1034 }
1035 
IsTrailSurrogate(uint32_t code_point)1036 static inline bool IsTrailSurrogate(uint32_t code_point) {
1037   return (code_point >= kMinTrailSurrogate) &&
1038          (code_point < kMaxTrailSurrogate);
1039 }
1040 
1041 // Combine a head and trail surrogate into a single Unicode code point.
AssembleUTF16(uint32_t head_surrogate,uint32_t trail_surrogate)1042 static uint32_t AssembleUTF16(uint32_t head_surrogate,
1043                               uint32_t trail_surrogate) {
1044   GOOGLE_DCHECK(IsHeadSurrogate(head_surrogate));
1045   GOOGLE_DCHECK(IsTrailSurrogate(trail_surrogate));
1046   return 0x10000 + (((head_surrogate - kMinHeadSurrogate) << 10) |
1047                     (trail_surrogate - kMinTrailSurrogate));
1048 }
1049 
1050 // Convert the escape sequence parameter to a number of expected hex digits.
UnicodeLength(char key)1051 static inline int UnicodeLength(char key) {
1052   if (key == 'u') return 4;
1053   if (key == 'U') return 8;
1054   return 0;
1055 }
1056 
1057 // Given a pointer to the 'u' or 'U' starting a Unicode escape sequence, attempt
1058 // to parse that sequence. On success, returns a pointer to the first char
1059 // beyond that sequence, and fills in *code_point. On failure, returns ptr
1060 // itself.
FetchUnicodePoint(const char * ptr,uint32_t * code_point)1061 static const char* FetchUnicodePoint(const char* ptr, uint32_t* code_point) {
1062   const char* p = ptr;
1063   // Fetch the code point.
1064   const int len = UnicodeLength(*p++);
1065   if (!ReadHexDigits(p, len, code_point)) return ptr;
1066   p += len;
1067 
1068   // Check if the code point we read is a "head surrogate." If so, then we
1069   // expect it to be immediately followed by another code point which is a valid
1070   // "trail surrogate," and together they form a UTF-16 pair which decodes into
1071   // a single Unicode point. Trail surrogates may only use \u, not \U.
1072   if (IsHeadSurrogate(*code_point) && *p == '\\' && *(p + 1) == 'u') {
1073     uint32_t trail_surrogate;
1074     if (ReadHexDigits(p + 2, 4, &trail_surrogate) &&
1075         IsTrailSurrogate(trail_surrogate)) {
1076       *code_point = AssembleUTF16(*code_point, trail_surrogate);
1077       p += 6;
1078     }
1079     // If this failed, then we just emit the head surrogate as a code point.
1080     // It's bogus, but so is the string.
1081   }
1082 
1083   return p;
1084 }
1085 
1086 // The text string must begin and end with single or double quote
1087 // characters.
ParseStringAppend(const std::string & text,std::string * output)1088 void Tokenizer::ParseStringAppend(const std::string& text,
1089                                   std::string* output) {
1090   // Reminder: text[0] is always a quote character.  (If text is
1091   // empty, it's invalid, so we'll just return).
1092   const size_t text_size = text.size();
1093   if (text_size == 0) {
1094     GOOGLE_LOG(DFATAL) << " Tokenizer::ParseStringAppend() passed text that could not"
1095                    " have been tokenized as a string: "
1096                 << CEscape(text);
1097     return;
1098   }
1099 
1100   // Reserve room for new string. The branch is necessary because if
1101   // there is already space available the reserve() call might
1102   // downsize the output.
1103   const size_t new_len = text_size + output->size();
1104   if (new_len > output->capacity()) {
1105     output->reserve(new_len);
1106   }
1107 
1108   // Loop through the string copying characters to "output" and
1109   // interpreting escape sequences.  Note that any invalid escape
1110   // sequences or other errors were already reported while tokenizing.
1111   // In this case we do not need to produce valid results.
1112   for (const char* ptr = text.c_str() + 1; *ptr != '\0'; ptr++) {
1113     if (*ptr == '\\' && ptr[1] != '\0') {
1114       // An escape sequence.
1115       ++ptr;
1116 
1117       if (OctalDigit::InClass(*ptr)) {
1118         // An octal escape.  May one, two, or three digits.
1119         int code = DigitValue(*ptr);
1120         if (OctalDigit::InClass(ptr[1])) {
1121           ++ptr;
1122           code = code * 8 + DigitValue(*ptr);
1123         }
1124         if (OctalDigit::InClass(ptr[1])) {
1125           ++ptr;
1126           code = code * 8 + DigitValue(*ptr);
1127         }
1128         output->push_back(static_cast<char>(code));
1129 
1130       } else if (*ptr == 'x') {
1131         // A hex escape.  May zero, one, or two digits.  (The zero case
1132         // will have been caught as an error earlier.)
1133         int code = 0;
1134         if (HexDigit::InClass(ptr[1])) {
1135           ++ptr;
1136           code = DigitValue(*ptr);
1137         }
1138         if (HexDigit::InClass(ptr[1])) {
1139           ++ptr;
1140           code = code * 16 + DigitValue(*ptr);
1141         }
1142         output->push_back(static_cast<char>(code));
1143 
1144       } else if (*ptr == 'u' || *ptr == 'U') {
1145         uint32_t unicode;
1146         const char* end = FetchUnicodePoint(ptr, &unicode);
1147         if (end == ptr) {
1148           // Failure: Just dump out what we saw, don't try to parse it.
1149           output->push_back(*ptr);
1150         } else {
1151           AppendUTF8(unicode, output);
1152           ptr = end - 1;  // Because we're about to ++ptr.
1153         }
1154       } else {
1155         // Some other escape code.
1156         output->push_back(TranslateEscape(*ptr));
1157       }
1158 
1159     } else if (*ptr == text[0] && ptr[1] == '\0') {
1160       // Ignore final quote matching the starting quote.
1161     } else {
1162       output->push_back(*ptr);
1163     }
1164   }
1165 }
1166 
1167 template <typename CharacterClass>
AllInClass(const std::string & s)1168 static bool AllInClass(const std::string& s) {
1169   for (const char character : s) {
1170     if (!CharacterClass::InClass(character)) return false;
1171   }
1172   return true;
1173 }
1174 
IsIdentifier(const std::string & text)1175 bool Tokenizer::IsIdentifier(const std::string& text) {
1176   // Mirrors IDENTIFIER definition in Tokenizer::Next() above.
1177   if (text.size() == 0) return false;
1178   if (!Letter::InClass(text.at(0))) return false;
1179   if (!AllInClass<Alphanumeric>(text.substr(1))) return false;
1180   return true;
1181 }
1182 
1183 }  // namespace io
1184 }  // namespace protobuf
1185 }  // namespace google
1186