1#!/usr/bin/env python 2# Copyright 2014 The Chromium Authors. All rights reserved. 3# Use of this source code is governed by a BSD-style license that can be 4# found in the LICENSE.chromium file. 5 6""" 7A Deterministic acyclic finite state automaton (DAFSA) is a compact 8representation of an unordered word list (dictionary). 9 10https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton 11 12This python program converts a list of strings to a byte array in C++. 13This python program fetches strings and return values from a gperf file 14and generates a C++ file with a byte array representing graph that can be 15used as a memory efficient replacement for the perfect hash table. 16 17The input strings must consist of printable 7-bit ASCII characters or UTF-8 18multibyte sequences. Control characters in the range [0x00-0x1F] are not 19allowed. The return values must be one digit integers. . 20 21In this program a DAFSA is a diamond shaped graph starting at a common 22source node and ending at a common sink node. All internal nodes contain 23a label and each word is represented by the labels in one path from 24the source node to the sink node. 25 26The following python represention is used for nodes: 27 28 Source node: [ children ] 29 Internal node: (label, [ children ]) 30 Sink node: None 31 32The graph is first compressed by prefixes like a trie. In the next step 33suffixes are compressed so that the graph gets diamond shaped. Finally 34one to one linked nodes are replaced by nodes with the labels joined. 35 36The order of the operations is crucial since lookups will be performed 37starting from the source with no backtracking. Thus a node must have at 38most one child with a label starting by the same character. The output 39is also arranged so that all jumps are to increasing addresses, thus forward 40in memory. 41 42The generated output has suffix free decoding so that the sign of leading 43bits in a link (a reference to a child node) indicate if it has a size of one, 44two or three bytes and if it is the last outgoing link from the actual node. 45A node label is terminated by a byte with the leading bit set. 46 47The generated byte array can described by the following BNF: 48 49<byte> ::= < 8-bit value in range [0x00-0xFF] > 50 51<char> ::= < byte in range [0x1F-0x7F] > 52<end_char> ::= < char + 0x80, byte in range [0x9F-0xFF] > 53<return value> ::= < value + 0x80, byte in range [0x80-0x8F] > 54 55<offset1> ::= < byte in range [0x00-0x3F] > 56<offset2> ::= < byte in range [0x40-0x5F] > 57<offset3> ::= < byte in range [0x60-0x7F] > 58 59<end_offset1> ::= < byte in range [0x80-0xBF] > 60<end_offset2> ::= < byte in range [0xC0-0xDF] > 61<end_offset3> ::= < byte in range [0xE0-0xFF] > 62 63<prefix> ::= <char> 64 65<label> ::= <end_char> 66 | <char> <label> 67 68<end_label> ::= <return_value> 69 | <char> <end_label> 70 71<offset> ::= <offset1> 72 | <offset2> <byte> 73 | <offset3> <byte> <byte> 74 75<end_offset> ::= <end_offset1> 76 | <end_offset2> <byte> 77 | <end_offset3> <byte> <byte> 78 79<offsets> ::= <end_offset> 80 | <offset> <offsets> 81 82<source> ::= <offsets> 83 84<node> ::= <label> <offsets> 85 | <prefix> <node> 86 | <end_label> 87 88<graph> ::= <graph> 89 | <graph> <node> 90 91<version> ::= <empty> # The DAFSA was generated in ASCII mode. 92 | < byte value 0x01 > # The DAFSA was generated in UTF-8 mode. 93 94<dafsa> ::= <graph> <version> 95 96Decoding: 97 98<char> -> character 99<end_char> & 0x7F -> character 100<return value> & 0x0F -> integer 101<offset1 & 0x3F> -> integer 102((<offset2> & 0x1F>) << 8) + <byte> -> integer 103((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer 104 105end_offset1, end_offset2 and and_offset3 are decoded same as offset1, 106offset2 and offset3 respectively. 107 108The first offset in a list of offsets is the distance in bytes between the 109offset itself and the first child node. Subsequent offsets are the distance 110between previous child node and next child node. Thus each offset links a node 111to a child node. The distance is always counted between start addresses, i.e. 112first byte in decoded offset or first byte in child node. 113 114Transcoding of UTF-8 multibyte sequences: 115 116The original DAFSA format was limited to 7-bit printable ASCII characters in 117range [0x20-0xFF], but has been extended to allow UTF-8 multibyte sequences. 118By transcoding of such characters the new format preserves compatibility with 119old parsers, so that a DAFSA in the extended format can be used by an old 120parser without false positives, although strings containing transcoded 121characters will never match. Since the format is extended rather than being 122changed, a parser supporting the new format will automatically support data 123generated in the old format. 124 125Transcoding is performed by insertion of a start byte with the special value 1260x1F, followed by 2-4 bytes shifted into the range [0x40-0x7F], thus inside 127the range of printable ASCII. 128 1292-byte: 110nnnnn, 10nnnnnn -> 00011111, 010nnnnn, 01nnnnnn 130 1313-byte: 1110nnnn, 10nnnnnn, 10nnnnnn -> 00011111, 0110nnnn, 01nnnnnn, 01nnnnnn 132 1334-byte: 11110nnn, 10nnnnnn, 10nnnnnn, 10nnnnnn -> 134 00011111, 01110nnn, 01nnnnnn, 01nnnnnn, 01nnnnnn 135 136Example 1: 137 138%% 139aa, 1 140a, 2 141%% 142 143The input is first parsed to a list of words: 144["aa1", "a2"] 145 146A fully expanded graph is created from the words: 147source = [node1, node4] 148node1 = ("a", [node2]) 149node2 = ("a", [node3]) 150node3 = ("\x01", [sink]) 151node4 = ("a", [node5]) 152node5 = ("\x02", [sink]) 153sink = None 154 155Compression results in the following graph: 156source = [node1] 157node1 = ("a", [node2, node3]) 158node2 = ("\x02", [sink]) 159node3 = ("a\x01", [sink]) 160sink = None 161 162A C++ representation of the compressed graph is generated: 163 164const unsigned char dafsa[7] = { 165 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81, 166}; 167 168The bytes in the generated array has the following meaning: 169 170 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1 171 172 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a" 173 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4 174 175 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5 176 4: 0x82 <return_value> 0x82 & 0x0F -> return 2 177 178 5: 0x61 <char> label character 0x61 -> match "a" 179 6: 0x81 <return_value> 0x81 & 0x0F -> return 1 180 181Example 2: 182 183%% 184aa, 1 185bbb, 2 186baa, 1 187%% 188 189The input is first parsed to a list of words: 190["aa1", "bbb2", "baa1"] 191 192Compression results in the following graph: 193source = [node1, node2] 194node1 = ("b", [node2, node3]) 195node2 = ("aa\x01", [sink]) 196node3 = ("bb\x02", [sink]) 197sink = None 198 199A C++ representation of the compressed graph is generated: 200 201const unsigned char dafsa[11] = { 202 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82, 203}; 204 205The bytes in the generated array has the following meaning: 206 207 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2 208 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5 209 210 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b" 211 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5 212 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8 213 214 5: 0x61 <char> label character 0x61 -> match "a" 215 6: 0x61 <char> label character 0x61 -> match "a" 216 7: 0x81 <return_value> 0x81 & 0x0F -> return 1 217 218 8: 0x62 <char> label character 0x62 -> match "b" 219 9: 0x62 <char> label character 0x62 -> match "b" 22010: 0x82 <return_value> 0x82 & 0x0F -> return 2 221""" 222 223import sys 224import os.path 225import hashlib 226 227class InputError(Exception): 228 """Exception raised for errors in the input file.""" 229 230# Length of a character starting at a given byte. 231char_length_table = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x00-0x0F 232 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x10-0x1F 233 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x20-0x2F 234 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x30-x03F 235 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x40-0x4F 236 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x50-x05F 237 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x60-0x6F 238 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 0x70-x07F 239 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x80-0x8F 240 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0x90-0x9F 241 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0xA0-0xAF 242 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 0xB0-0xBF 243 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, # 0xC0-0xCF 244 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, # 0xD0-0xDF 245 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, # 0xE0-0xEF 246 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0 ) # 0xF0-0xFF 247 248def to_bytes(n): 249 """Converts an integer value to a bytes object.""" 250 return bytes(bytearray((n,))) 251 252def to_dafsa(words, utf_mode): 253 """Generates a DAFSA from a word list and returns the source node. 254 255 Each word is split into characters so that each character is represented by 256 a unique node. It is assumed the word list is not empty. 257 """ 258 if not words: 259 raise InputError('The domain list must not be empty') 260 def to_nodes(word, multibyte_length): 261 """Split words into characters""" 262 byte = ord(word[:1]) 263 if multibyte_length: 264 # Consume next byte in multibyte sequence. 265 if byte & 0xC0 != 0x80: 266 raise InputError('Invalid UTF-8 multibyte sequence') 267 return to_bytes(byte ^ 0xC0), [to_nodes(word[1:], multibyte_length - 1)] 268 char_length = char_length_table[byte] 269 if char_length == 1: 270 # 7-bit printable ASCII. 271 if len(word) == 1: 272 return to_bytes(int(word[:1], 16) & 0x0F), [None] 273 return word[:1], [to_nodes(word[1:], 0)] 274 elif char_length > 1: 275 # Leading byte in multibyte sequence. 276 if not utf_mode: 277 raise InputError('UTF-8 encoded characters are not allowed in ASCII mode') 278 if len(word) <= char_length: 279 raise InputError('Unterminated UTF-8 multibyte sequence') 280 return to_bytes(0x1F), [(to_bytes(byte ^ 0x80), [to_nodes(word[1:], char_length - 1)])] 281 # Unexpected character. 282 raise InputError('Domain names must be printable ASCII or UTF-8') 283 284 return [to_nodes(word, 0) for word in words] 285 286def to_words(node): 287 """Generates a word list from all paths starting from an internal node.""" 288 if not node: 289 return [b''] 290 return [(node[0] + word) for child in node[1] for word in to_words(child)] 291 292 293def reverse(dafsa): 294 """Generates a new DAFSA that is reversed, so that the old sink node becomes 295 the new source node. 296 """ 297 sink = [] 298 nodemap = {} 299 300 def dfs(node, parent): 301 """Creates reverse nodes. 302 303 A new reverse node will be created for each old node. The new node will 304 get a reversed label and the parents of the old node as children. 305 """ 306 if not node: 307 sink.append(parent) 308 elif id(node) not in nodemap: 309 nodemap[id(node)] = (node[0][::-1], [parent]) 310 for child in node[1]: 311 dfs(child, nodemap[id(node)]) 312 else: 313 nodemap[id(node)][1].append(parent) 314 315 for node in dafsa: 316 dfs(node, None) 317 return sink 318 319 320def join_labels(dafsa): 321 """Generates a new DAFSA where internal nodes are merged if there is a one to 322 one connection. 323 """ 324 parentcount = {id(None): 2} 325 nodemap = {id(None): None} 326 327 def count_parents(node): 328 """Count incoming references""" 329 if id(node) in parentcount: 330 parentcount[id(node)] += 1 331 else: 332 parentcount[id(node)] = 1 333 for child in node[1]: 334 count_parents(child) 335 336 def join(node): 337 """Create new nodes""" 338 if id(node) not in nodemap: 339 children = [join(child) for child in node[1]] 340 if len(children) == 1 and parentcount[id(node[1][0])] == 1: 341 child = children[0] 342 nodemap[id(node)] = (node[0] + child[0], child[1]) 343 else: 344 nodemap[id(node)] = (node[0], children) 345 return nodemap[id(node)] 346 347 for node in dafsa: 348 count_parents(node) 349 return [join(node) for node in dafsa] 350 351 352def join_suffixes(dafsa): 353 """Generates a new DAFSA where nodes that represent the same word lists 354 towards the sink are merged. 355 """ 356 nodemap = {frozenset((b'',)): None} 357 358 def join(node): 359 """Returns a matching node. A new node is created if no matching node 360 exists. The graph is accessed in dfs order. 361 """ 362 suffixes = frozenset(to_words(node)) 363 if suffixes not in nodemap: 364 nodemap[suffixes] = (node[0], [join(child) for child in node[1]]) 365 return nodemap[suffixes] 366 367 return [join(node) for node in dafsa] 368 369 370def top_sort(dafsa): 371 """Generates list of nodes in topological sort order.""" 372 incoming = {} 373 374 def count_incoming(node): 375 """Counts incoming references.""" 376 if node: 377 if id(node) not in incoming: 378 incoming[id(node)] = 1 379 for child in node[1]: 380 count_incoming(child) 381 else: 382 incoming[id(node)] += 1 383 384 for node in dafsa: 385 count_incoming(node) 386 387 for node in dafsa: 388 incoming[id(node)] -= 1 389 390 waiting = [node for node in dafsa if incoming[id(node)] == 0] 391 nodes = [] 392 393 while waiting: 394 node = waiting.pop() 395 assert incoming[id(node)] == 0 396 nodes.append(node) 397 for child in node[1]: 398 if child: 399 incoming[id(child)] -= 1 400 if incoming[id(child)] == 0: 401 waiting.append(child) 402 return nodes 403 404 405def encode_links(children, offsets, current): 406 """Encodes a list of children as one, two or three byte offsets.""" 407 if not children[0]: 408 # This is an <end_label> node and no links follow such nodes 409 assert len(children) == 1 410 return [] 411 guess = 3 * len(children) 412 assert children 413 children = sorted(children, key=lambda x: -offsets[id(x)]) 414 while True: 415 offset = current + guess 416 buf = [] 417 for child in children: 418 last = len(buf) 419 distance = offset - offsets[id(child)] 420 assert distance > 0 and distance < (1 << 21) 421 422 if distance < (1 << 6): 423 # A 6-bit offset: "s0xxxxxx" 424 buf.append(distance) 425 elif distance < (1 << 13): 426 # A 13-bit offset: "s10xxxxxxxxxxxxx" 427 buf.append(0x40 | (distance >> 8)) 428 buf.append(distance & 0xFF) 429 else: 430 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx" 431 buf.append(0x60 | (distance >> 16)) 432 buf.append((distance >> 8) & 0xFF) 433 buf.append(distance & 0xFF) 434 # Distance in first link is relative to following record. 435 # Distance in other links are relative to previous link. 436 offset -= distance 437 if len(buf) == guess: 438 break 439 guess = len(buf) 440 # Set most significant bit to mark end of links in this node. 441 buf[last] |= (1 << 7) 442 buf.reverse() 443 return buf 444 445 446def encode_prefix(label): 447 """Encodes a node label as a list of bytes without a trailing high byte. 448 449 This method encodes a node if there is exactly one child and the 450 child follows immediately after so that no jump is needed. This label 451 will then be a prefix to the label in the child node. 452 """ 453 assert label 454 return [c for c in bytearray(reversed(label))] 455 456 457def encode_label(label): 458 """Encodes a node label as a list of bytes with a trailing high byte >0x80. 459 """ 460 buf = encode_prefix(label) 461 # Set most significant bit to mark end of label in this node. 462 buf[0] |= (1 << 7) 463 return buf 464 465 466def encode(dafsa, utf_mode): 467 """Encodes a DAFSA to a list of bytes""" 468 output = [] 469 offsets = {} 470 471 for node in reversed(top_sort(dafsa)): 472 if (len(node[1]) == 1 and node[1][0] and 473 (offsets[id(node[1][0])] == len(output))): 474 output.extend(encode_prefix(node[0])) 475 else: 476 output.extend(encode_links(node[1], offsets, len(output))) 477 output.extend(encode_label(node[0])) 478 offsets[id(node)] = len(output) 479 480 output.extend(encode_links(dafsa, offsets, len(output))) 481 output.reverse() 482 if utf_mode: 483 output.append(0x01) 484 return output 485 486 487def to_cxx(data, codecs): 488 """Generates C++ code from a list of encoded bytes.""" 489 text = b'/* This file has been generated by psl-make-dafsa. DO NOT EDIT!\n\n' 490 text += b'The byte array encodes effective tld names. See psl-make-dafsa source for' 491 text += b' documentation.' 492 text += b'*/\n\n' 493 text += b'static const unsigned char kDafsa[' 494 text += bytes(str(len(data)), **codecs) 495 text += b'] = {\n' 496 for i in range(0, len(data), 12): 497 text += b' ' 498 text += bytes(', '.join('0x%02x' % byte for byte in data[i:i + 12]), **codecs) 499 text += b',\n' 500 text += b'};\n' 501 return text 502 503def sha1_file(name): 504 sha1 = hashlib.sha1() 505 with open(name, 'rb') as f: 506 while True: 507 data = f.read(65536) 508 if not data: 509 break 510 sha1.update(data) 511 return sha1.hexdigest() 512 513def to_cxx_plus(data, codecs): 514 """Generates C++ code from a word list plus some variable assignments as needed by libpsl""" 515 text = to_cxx(data, codecs) 516 text += b'static time_t _psl_file_time = %d;\n' % os.stat(psl_input_file).st_mtime 517 text += b'static int _psl_nsuffixes = %d;\n' % psl_nsuffixes 518 text += b'static int _psl_nexceptions = %d;\n' % psl_nexceptions 519 text += b'static int _psl_nwildcards = %d;\n' % psl_nwildcards 520 text += b'static const char _psl_sha1_checksum[] = "%s";\n' % bytes(sha1_file(psl_input_file), **codecs) 521 text += b'static const char _psl_filename[] = "%s";\n' % bytes(psl_input_file, **codecs) 522 return text 523 524def words_to_whatever(words, converter, utf_mode, codecs): 525 """Generates C++ code from a word list""" 526 dafsa = to_dafsa(words, utf_mode) 527 for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels): 528 dafsa = fun(dafsa) 529 return converter(encode(dafsa, utf_mode), codecs) 530 531 532def words_to_cxx(words, utf_mode, codecs): 533 """Generates C++ code from a word list""" 534 return words_to_whatever(words, to_cxx, utf_mode, codecs) 535 536def words_to_cxx_plus(words, utf_mode, codecs): 537 """Generates C++ code from a word list plus some variable assignments as needed by libpsl""" 538 return words_to_whatever(words, to_cxx_plus, utf_mode, codecs) 539 540def words_to_binary(words, utf_mode, codecs): 541 """Generates C++ code from a word list""" 542 return b'.DAFSA@PSL_0 \n' + words_to_whatever(words, lambda x, _: bytearray(x), utf_mode, codecs) 543 544 545def parse_psl(infile, utf_mode, codecs): 546 """Parses PSL file and extract strings and return code""" 547 PSL_FLAG_EXCEPTION = (1<<0) 548 PSL_FLAG_WILDCARD = (1<<1) 549 PSL_FLAG_ICANN = (1<<2) # entry of ICANN section 550 PSL_FLAG_PRIVATE = (1<<3) # entry of PRIVATE section 551 PSL_FLAG_PLAIN = (1<<4) #just used for PSL syntax checking 552 553 global psl_nsuffixes, psl_nexceptions, psl_nwildcards 554 555 psl = {} 556 section = 0 557 558 for line in infile: 559 line = bytes(line.strip(), **codecs) 560 if not line: 561 continue 562 563 if line.startswith(b'//'): 564 if section == 0: 565 if b'===BEGIN ICANN DOMAINS===' in line: 566 section = PSL_FLAG_ICANN 567 elif b'===BEGIN PRIVATE DOMAINS===' in line: 568 section = PSL_FLAG_PRIVATE 569 elif section == PSL_FLAG_ICANN and b'===END ICANN DOMAINS===' in line: 570 section = 0 571 elif section == PSL_FLAG_PRIVATE and b'===END PRIVATE DOMAINS===' in line: 572 section = 0 573 continue # skip comments 574 575 if line[:1] == b'!': 576 psl_nexceptions += 1 577 flags = PSL_FLAG_EXCEPTION | section 578 line = line[1:] 579 elif line[:1] == b'*': 580 if line[1:2] != b'.': 581 print('Unsupported kind of rule (ignored): %s' % line) 582 continue 583 psl_nwildcards += 1 584 psl_nsuffixes += 1 585 flags = PSL_FLAG_WILDCARD | PSL_FLAG_PLAIN | section 586 line = line[2:] 587 else: 588 psl_nsuffixes += 1 589 flags = PSL_FLAG_PLAIN | section 590 591 punycode = line.decode('utf-8').encode('idna') 592 593 if punycode in psl: 594 """Found existing entry: 595 Combination of exception and plain rule is ambiguous 596 !foo.bar 597 foo.bar 598 599 Allowed: 600 !foo.bar + *.foo.bar 601 foo.bar + *.foo.bar 602 """ 603 print('Found %s/%X (now %X)' % punycode, psl[punycode], flags) 604 continue 605 606 if utf_mode: 607 psl[line] = flags 608 psl[punycode] = flags 609 610# with open("psl.out", 'w') as outfile: 611# for (domain, flags) in sorted(psl.iteritems()): 612# outfile.write(domain + "%X" % (flags & 0x0F) + "\n") 613 614 return [domain + bytes('%X' % (flags & 0x0F), **codecs) for (domain, flags) in sorted(psl.items())] 615 616 617def usage(): 618 """Prints the usage""" 619 print('usage: %s [options] infile outfile' % sys.argv[0]) 620 print(' --output-format=cxx Write DAFSA as C/C++ code (default)') 621 print(' --output-format=cxx+ Write DAFSA as C/C++ code plus statistical assignments') 622 print(' --output-format=binary Write DAFSA binary data') 623 print(' --encoding=ascii 7-bit ASCII mode') 624 print(' --encoding=utf-8 UTF-8 mode (default)') 625 exit(1) 626 627 628def main(): 629 """Convert PSL file into C or binary DAFSA file""" 630 if len(sys.argv) < 3: 631 usage() 632 633 converter = words_to_cxx 634 parser = parse_psl 635 utf_mode = True 636 637 codecs = dict() 638 if sys.version_info.major > 2: 639 codecs['encoding'] = 'utf-8' 640 641 for arg in sys.argv[1:-2]: 642 # Check --input-format for backward compatibility 643 if arg.startswith('--input-format='): 644 value = arg[15:].lower() 645 if value == 'psl': 646 parser = parse_psl 647 else: 648 print("Unknown input format '%s'" % value) 649 return 1 650 elif arg.startswith('--output-format='): 651 value = arg[16:].lower() 652 if value == 'binary': 653 converter = words_to_binary 654 elif value == 'cxx': 655 converter = words_to_cxx 656 elif value == 'cxx+': 657 converter = words_to_cxx_plus 658 else: 659 print("Unknown output format '%s'" % value) 660 return 1 661 elif arg.startswith('--encoding='): 662 value = arg[11:].lower() 663 if value == 'ascii': 664 utf_mode = False 665 elif value == 'utf-8': 666 utf_mode = True 667 else: 668 print("Unknown encoding '%s'" % value) 669 return 1 670 else: 671 usage() 672 673 if sys.argv[-2] == '-': 674 with open(sys.argv[-1], 'wb') as outfile: 675 outfile.write(converter(parser(sys.stdin, utf_mode, codecs), utf_mode, codecs)) 676 else: 677 """Some statistical data for --output-format=cxx+""" 678 global psl_input_file, psl_nsuffixes, psl_nexceptions, psl_nwildcards 679 680 psl_input_file = sys.argv[-2] 681 psl_nsuffixes = 0 682 psl_nexceptions = 0 683 psl_nwildcards = 0 684 685 with open(sys.argv[-2], 'r', **codecs) as infile, open(sys.argv[-1], 'wb') as outfile: 686 outfile.write(converter(parser(infile, utf_mode, codecs), utf_mode, codecs)) 687 688 return 0 689 690 691if __name__ == '__main__': 692 sys.exit(main()) 693