• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1#!/usr/bin/env python
2# Copyright 2014 The Chromium Authors. All rights reserved.
3# Use of this source code is governed by a BSD-style license that can be
4# found in the LICENSE.chromium file.
5
6"""
7A Deterministic acyclic finite state automaton (DAFSA) is a compact
8representation of an unordered word list (dictionary).
9
10https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
11
12This python program converts a list of strings to a byte array in C++.
13This python program fetches strings and return values from a gperf file
14and generates a C++ file with a byte array representing graph that can be
15used as a memory efficient replacement for the perfect hash table.
16
17The input strings must consist of printable 7-bit ASCII characters or UTF-8
18multibyte sequences. Control characters in the range [0x00-0x1F] are not
19allowed. The return values must be one digit integers. .
20
21In this program a DAFSA is a diamond shaped graph starting at a common
22source node and ending at a common sink node. All internal nodes contain
23a label and each word is represented by the labels in one path from
24the source node to the sink node.
25
26The following python represention is used for nodes:
27
28  Source node: [ children ]
29  Internal node: (label, [ children ])
30  Sink node: None
31
32The graph is first compressed by prefixes like a trie. In the next step
33suffixes are compressed so that the graph gets diamond shaped. Finally
34one to one linked nodes are replaced by nodes with the labels joined.
35
36The order of the operations is crucial since lookups will be performed
37starting from the source with no backtracking. Thus a node must have at
38most one child with a label starting by the same character. The output
39is also arranged so that all jumps are to increasing addresses, thus forward
40in memory.
41
42The generated output has suffix free decoding so that the sign of leading
43bits in a link (a reference to a child node) indicate if it has a size of one,
44two or three bytes and if it is the last outgoing link from the actual node.
45A node label is terminated by a byte with the leading bit set.
46
47The generated byte array can described by the following BNF:
48
49<byte> ::= < 8-bit value in range [0x00-0xFF] >
50
51<char> ::= < byte in range [0x1F-0x7F] >
52<end_char> ::= < char + 0x80, byte in range [0x9F-0xFF] >
53<return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
54
55<offset1> ::= < byte in range [0x00-0x3F] >
56<offset2> ::= < byte in range [0x40-0x5F] >
57<offset3> ::= < byte in range [0x60-0x7F] >
58
59<end_offset1> ::= < byte in range [0x80-0xBF] >
60<end_offset2> ::= < byte in range [0xC0-0xDF] >
61<end_offset3> ::= < byte in range [0xE0-0xFF] >
62
63<prefix> ::= <char>
64
65<label> ::= <end_char>
66          | <char> <label>
67
68<end_label> ::= <return_value>
69          | <char> <end_label>
70
71<offset> ::= <offset1>
72           | <offset2> <byte>
73           | <offset3> <byte> <byte>
74
75<end_offset> ::= <end_offset1>
76               | <end_offset2> <byte>
77               | <end_offset3> <byte> <byte>
78
79<offsets> ::= <end_offset>
80            | <offset> <offsets>
81
82<source> ::= <offsets>
83
84<node> ::= <label> <offsets>
85         | <prefix> <node>
86         | <end_label>
87
88<graph> ::= <graph>
89          | <graph> <node>
90
91<version> ::= <empty>            # The DAFSA was generated in ASCII mode.
92          | < byte value 0x01 >  # The DAFSA was generated in UTF-8 mode.
93
94<dafsa> ::= <graph> <version>
95
96Decoding:
97
98<char> -> character
99<end_char> & 0x7F -> character
100<return value> & 0x0F -> integer
101<offset1 & 0x3F> -> integer
102((<offset2> & 0x1F>) << 8) + <byte> -> integer
103((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
104
105end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
106offset2 and offset3 respectively.
107
108The first offset in a list of offsets is the distance in bytes between the
109offset itself and the first child node. Subsequent offsets are the distance
110between previous child node and next child node. Thus each offset links a node
111to a child node. The distance is always counted between start addresses, i.e.
112first byte in decoded offset or first byte in child node.
113
114Transcoding of UTF-8 multibyte sequences:
115
116The original DAFSA format was limited to 7-bit printable ASCII characters in
117range [0x20-0xFF], but has been extended to allow UTF-8 multibyte sequences.
118By transcoding of such characters the new format preserves compatibility with
119old parsers, so that a DAFSA in the extended format can be used by an old
120parser without false positives, although strings containing transcoded
121characters will never match. Since the format is extended rather than being
122changed, a parser supporting the new format will automatically support data
123generated in the old format.
124
125Transcoding is performed by insertion of a start byte with the special value
1260x1F, followed by 2-4 bytes shifted into the range [0x40-0x7F], thus inside
127the range of printable ASCII.
128
1292-byte: 110nnnnn, 10nnnnnn -> 00011111, 010nnnnn, 01nnnnnn
130
1313-byte: 1110nnnn, 10nnnnnn, 10nnnnnn -> 00011111, 0110nnnn, 01nnnnnn, 01nnnnnn
132
1334-byte: 11110nnn, 10nnnnnn, 10nnnnnn, 10nnnnnn ->
134                00011111, 01110nnn, 01nnnnnn, 01nnnnnn, 01nnnnnn
135
136Example 1:
137
138%%
139aa, 1
140a, 2
141%%
142
143The input is first parsed to a list of words:
144["aa1", "a2"]
145
146A fully expanded graph is created from the words:
147source = [node1, node4]
148node1 = ("a", [node2])
149node2 = ("a", [node3])
150node3 = ("\x01", [sink])
151node4 = ("a", [node5])
152node5 = ("\x02", [sink])
153sink = None
154
155Compression results in the following graph:
156source = [node1]
157node1 = ("a", [node2, node3])
158node2 = ("\x02", [sink])
159node3 = ("a\x01", [sink])
160sink = None
161
162A C++ representation of the compressed graph is generated:
163
164const unsigned char dafsa[7] = {
165  0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
166};
167
168The bytes in the generated array has the following meaning:
169
170 0: 0x81 <end_offset1>  child at position 0 + (0x81 & 0x3F) -> jump to 1
171
172 1: 0xE1 <end_char>     label character (0xE1 & 0x7F) -> match "a"
173 2: 0x02 <offset1>      child at position 2 + (0x02 & 0x3F) -> jump to 4
174
175 3: 0x81 <end_offset1>  child at position 4 + (0x81 & 0x3F) -> jump to 5
176 4: 0x82 <return_value> 0x82 & 0x0F -> return 2
177
178 5: 0x61 <char>         label character 0x61 -> match "a"
179 6: 0x81 <return_value> 0x81 & 0x0F -> return 1
180
181Example 2:
182
183%%
184aa, 1
185bbb, 2
186baa, 1
187%%
188
189The input is first parsed to a list of words:
190["aa1", "bbb2", "baa1"]
191
192Compression results in the following graph:
193source = [node1, node2]
194node1 = ("b", [node2, node3])
195node2 = ("aa\x01", [sink])
196node3 = ("bb\x02", [sink])
197sink = None
198
199A C++ representation of the compressed graph is generated:
200
201const unsigned char dafsa[11] = {
202  0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
203};
204
205The bytes in the generated array has the following meaning:
206
207 0: 0x02 <offset1>      child at position 0 + (0x02 & 0x3F) -> jump to 2
208 1: 0x83 <end_offset1>  child at position 2 + (0x83 & 0x3F) -> jump to 5
209
210 2: 0xE2 <end_char>     label character (0xE2 & 0x7F) -> match "b"
211 3: 0x02 <offset1>      child at position 3 + (0x02 & 0x3F) -> jump to 5
212 4: 0x83 <end_offset1>  child at position 5 + (0x83 & 0x3F) -> jump to 8
213
214 5: 0x61 <char>         label character 0x61 -> match "a"
215 6: 0x61 <char>         label character 0x61 -> match "a"
216 7: 0x81 <return_value> 0x81 & 0x0F -> return 1
217
218 8: 0x62 <char>         label character 0x62 -> match "b"
219 9: 0x62 <char>         label character 0x62 -> match "b"
22010: 0x82 <return_value> 0x82 & 0x0F -> return 2
221"""
222
223import sys
224import os.path
225import hashlib
226
227class InputError(Exception):
228  """Exception raised for errors in the input file."""
229
230# Length of a character starting at a given byte.
231char_length_table = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x00-0x0F
232                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x10-0x1F
233                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x20-0x2F
234                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x30-x03F
235                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x40-0x4F
236                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x50-x05F
237                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x60-0x6F
238                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 0x70-x07F
239                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x80-0x8F
240                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0x90-0x9F
241                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xA0-0xAF
242                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 0xB0-0xBF
243                      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # 0xC0-0xCF
244                      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # 0xD0-0xDF
245                      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  # 0xE0-0xEF
246                      4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0 ) # 0xF0-0xFF
247
248def to_bytes(n):
249  """Converts an integer value to a bytes object."""
250  return bytes(bytearray((n,)))
251
252def to_dafsa(words, utf_mode):
253  """Generates a DAFSA from a word list and returns the source node.
254
255  Each word is split into characters so that each character is represented by
256  a unique node. It is assumed the word list is not empty.
257  """
258  if not words:
259    raise InputError('The domain list must not be empty')
260  def to_nodes(word, multibyte_length):
261    """Split words into characters"""
262    byte = ord(word[:1])
263    if multibyte_length:
264      # Consume next byte in multibyte sequence.
265      if byte & 0xC0 != 0x80:
266        raise InputError('Invalid UTF-8 multibyte sequence')
267      return to_bytes(byte ^ 0xC0), [to_nodes(word[1:], multibyte_length - 1)]
268    char_length = char_length_table[byte]
269    if char_length == 1:
270      # 7-bit printable ASCII.
271      if len(word) == 1:
272        return to_bytes(int(word[:1], 16) & 0x0F), [None]
273      return word[:1], [to_nodes(word[1:], 0)]
274    elif char_length > 1:
275      # Leading byte in multibyte sequence.
276      if not utf_mode:
277        raise InputError('UTF-8 encoded characters are not allowed in ASCII mode')
278      if len(word) <= char_length:
279        raise InputError('Unterminated UTF-8 multibyte sequence')
280      return to_bytes(0x1F), [(to_bytes(byte ^ 0x80), [to_nodes(word[1:], char_length - 1)])]
281    # Unexpected character.
282    raise InputError('Domain names must be printable ASCII or UTF-8')
283
284  return [to_nodes(word, 0) for word in words]
285
286def to_words(node):
287  """Generates a word list from all paths starting from an internal node."""
288  if not node:
289    return [b'']
290  return [(node[0] + word) for child in node[1] for word in to_words(child)]
291
292
293def reverse(dafsa):
294  """Generates a new DAFSA that is reversed, so that the old sink node becomes
295  the new source node.
296  """
297  sink = []
298  nodemap = {}
299
300  def dfs(node, parent):
301    """Creates reverse nodes.
302
303    A new reverse node will be created for each old node. The new node will
304    get a reversed label and the parents of the old node as children.
305    """
306    if not node:
307      sink.append(parent)
308    elif id(node) not in nodemap:
309      nodemap[id(node)] = (node[0][::-1], [parent])
310      for child in node[1]:
311        dfs(child, nodemap[id(node)])
312    else:
313      nodemap[id(node)][1].append(parent)
314
315  for node in dafsa:
316    dfs(node, None)
317  return sink
318
319
320def join_labels(dafsa):
321  """Generates a new DAFSA where internal nodes are merged if there is a one to
322  one connection.
323  """
324  parentcount = {id(None): 2}
325  nodemap = {id(None): None}
326
327  def count_parents(node):
328    """Count incoming references"""
329    if id(node) in parentcount:
330      parentcount[id(node)] += 1
331    else:
332      parentcount[id(node)] = 1
333      for child in node[1]:
334        count_parents(child)
335
336  def join(node):
337    """Create new nodes"""
338    if id(node) not in nodemap:
339      children = [join(child) for child in node[1]]
340      if len(children) == 1 and parentcount[id(node[1][0])] == 1:
341        child = children[0]
342        nodemap[id(node)] = (node[0] + child[0], child[1])
343      else:
344        nodemap[id(node)] = (node[0], children)
345    return nodemap[id(node)]
346
347  for node in dafsa:
348    count_parents(node)
349  return [join(node) for node in dafsa]
350
351
352def join_suffixes(dafsa):
353  """Generates a new DAFSA where nodes that represent the same word lists
354  towards the sink are merged.
355  """
356  nodemap = {frozenset((b'',)): None}
357
358  def join(node):
359    """Returns a matching node. A new node is created if no matching node
360    exists. The graph is accessed in dfs order.
361    """
362    suffixes = frozenset(to_words(node))
363    if suffixes not in nodemap:
364      nodemap[suffixes] = (node[0], [join(child) for child in node[1]])
365    return nodemap[suffixes]
366
367  return [join(node) for node in dafsa]
368
369
370def top_sort(dafsa):
371  """Generates list of nodes in topological sort order."""
372  incoming = {}
373
374  def count_incoming(node):
375    """Counts incoming references."""
376    if node:
377      if id(node) not in incoming:
378        incoming[id(node)] = 1
379        for child in node[1]:
380          count_incoming(child)
381      else:
382        incoming[id(node)] += 1
383
384  for node in dafsa:
385    count_incoming(node)
386
387  for node in dafsa:
388    incoming[id(node)] -= 1
389
390  waiting = [node for node in dafsa if incoming[id(node)] == 0]
391  nodes = []
392
393  while waiting:
394    node = waiting.pop()
395    assert incoming[id(node)] == 0
396    nodes.append(node)
397    for child in node[1]:
398      if child:
399        incoming[id(child)] -= 1
400        if incoming[id(child)] == 0:
401          waiting.append(child)
402  return nodes
403
404
405def encode_links(children, offsets, current):
406  """Encodes a list of children as one, two or three byte offsets."""
407  if not children[0]:
408    # This is an <end_label> node and no links follow such nodes
409    assert len(children) == 1
410    return []
411  guess = 3 * len(children)
412  assert children
413  children = sorted(children, key=lambda x: -offsets[id(x)])
414  while True:
415    offset = current + guess
416    buf = []
417    for child in children:
418      last = len(buf)
419      distance = offset - offsets[id(child)]
420      assert distance > 0 and distance < (1 << 21)
421
422      if distance < (1 << 6):
423        # A 6-bit offset: "s0xxxxxx"
424        buf.append(distance)
425      elif distance < (1 << 13):
426        # A 13-bit offset: "s10xxxxxxxxxxxxx"
427        buf.append(0x40 | (distance >> 8))
428        buf.append(distance & 0xFF)
429      else:
430        # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
431        buf.append(0x60 | (distance >> 16))
432        buf.append((distance >> 8) & 0xFF)
433        buf.append(distance & 0xFF)
434      # Distance in first link is relative to following record.
435      # Distance in other links are relative to previous link.
436      offset -= distance
437    if len(buf) == guess:
438      break
439    guess = len(buf)
440  # Set most significant bit to mark end of links in this node.
441  buf[last] |= (1 << 7)
442  buf.reverse()
443  return buf
444
445
446def encode_prefix(label):
447  """Encodes a node label as a list of bytes without a trailing high byte.
448
449  This method encodes a node if there is exactly one child  and the
450  child follows immediately after so that no jump is needed. This label
451  will then be a prefix to the label in the child node.
452  """
453  assert label
454  return [c for c in bytearray(reversed(label))]
455
456
457def encode_label(label):
458  """Encodes a node label as a list of bytes with a trailing high byte >0x80.
459  """
460  buf = encode_prefix(label)
461  # Set most significant bit to mark end of label in this node.
462  buf[0] |= (1 << 7)
463  return buf
464
465
466def encode(dafsa, utf_mode):
467  """Encodes a DAFSA to a list of bytes"""
468  output = []
469  offsets = {}
470
471  for node in reversed(top_sort(dafsa)):
472    if (len(node[1]) == 1 and node[1][0] and
473        (offsets[id(node[1][0])] == len(output))):
474      output.extend(encode_prefix(node[0]))
475    else:
476      output.extend(encode_links(node[1], offsets, len(output)))
477      output.extend(encode_label(node[0]))
478    offsets[id(node)] = len(output)
479
480  output.extend(encode_links(dafsa, offsets, len(output)))
481  output.reverse()
482  if utf_mode:
483    output.append(0x01)
484  return output
485
486
487def to_cxx(data, codecs):
488  """Generates C++ code from a list of encoded bytes."""
489  text = b'/* This file has been generated by psl-make-dafsa. DO NOT EDIT!\n\n'
490  text += b'The byte array encodes effective tld names. See psl-make-dafsa source for'
491  text += b' documentation.'
492  text += b'*/\n\n'
493  text += b'static const unsigned char kDafsa['
494  text += bytes(str(len(data)), **codecs)
495  text += b'] = {\n'
496  for i in range(0, len(data), 12):
497    text += b'  '
498    text += bytes(', '.join('0x%02x' % byte for byte in data[i:i + 12]), **codecs)
499    text += b',\n'
500  text += b'};\n'
501  return text
502
503def sha1_file(name):
504  sha1 = hashlib.sha1()
505  with open(name, 'rb') as f:
506    while True:
507        data = f.read(65536)
508        if not data:
509            break
510        sha1.update(data)
511  return sha1.hexdigest()
512
513def to_cxx_plus(data, codecs):
514  """Generates C++ code from a word list plus some variable assignments as needed by libpsl"""
515  text = to_cxx(data, codecs)
516  text += b'static time_t _psl_file_time = %d;\n' % os.stat(psl_input_file).st_mtime
517  text += b'static int _psl_nsuffixes = %d;\n' % psl_nsuffixes
518  text += b'static int _psl_nexceptions = %d;\n' % psl_nexceptions
519  text += b'static int _psl_nwildcards = %d;\n' % psl_nwildcards
520  text += b'static const char _psl_sha1_checksum[] = "%s";\n' % bytes(sha1_file(psl_input_file), **codecs)
521  text += b'static const char _psl_filename[] = "%s";\n' % bytes(psl_input_file, **codecs)
522  return text
523
524def words_to_whatever(words, converter, utf_mode, codecs):
525  """Generates C++ code from a word list"""
526  dafsa = to_dafsa(words, utf_mode)
527  for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels):
528    dafsa = fun(dafsa)
529  return converter(encode(dafsa, utf_mode), codecs)
530
531
532def words_to_cxx(words, utf_mode, codecs):
533  """Generates C++ code from a word list"""
534  return words_to_whatever(words, to_cxx, utf_mode, codecs)
535
536def words_to_cxx_plus(words, utf_mode, codecs):
537  """Generates C++ code from a word list plus some variable assignments as needed by libpsl"""
538  return words_to_whatever(words, to_cxx_plus, utf_mode, codecs)
539
540def words_to_binary(words, utf_mode, codecs):
541  """Generates C++ code from a word list"""
542  return b'.DAFSA@PSL_0   \n' + words_to_whatever(words, lambda x, _: bytearray(x), utf_mode, codecs)
543
544
545def parse_psl(infile, utf_mode, codecs):
546  """Parses PSL file and extract strings and return code"""
547  PSL_FLAG_EXCEPTION = (1<<0)
548  PSL_FLAG_WILDCARD = (1<<1)
549  PSL_FLAG_ICANN = (1<<2) # entry of ICANN section
550  PSL_FLAG_PRIVATE = (1<<3) # entry of PRIVATE section
551  PSL_FLAG_PLAIN = (1<<4) #just used for PSL syntax checking
552
553  global psl_nsuffixes, psl_nexceptions, psl_nwildcards
554
555  psl = {}
556  section = 0
557
558  for line in infile:
559    line = bytes(line.strip(), **codecs)
560    if not line:
561      continue
562
563    if line.startswith(b'//'):
564      if section == 0:
565        if b'===BEGIN ICANN DOMAINS===' in line:
566          section = PSL_FLAG_ICANN
567        elif b'===BEGIN PRIVATE DOMAINS===' in line:
568          section = PSL_FLAG_PRIVATE
569      elif section == PSL_FLAG_ICANN and b'===END ICANN DOMAINS===' in line:
570        section = 0
571      elif section == PSL_FLAG_PRIVATE and b'===END PRIVATE DOMAINS===' in line:
572        section = 0
573      continue # skip comments
574
575    if line[:1] == b'!':
576      psl_nexceptions += 1
577      flags = PSL_FLAG_EXCEPTION | section
578      line = line[1:]
579    elif line[:1] == b'*':
580      if line[1:2] != b'.':
581        print('Unsupported kind of rule (ignored): %s' % line)
582        continue
583      psl_nwildcards += 1
584      psl_nsuffixes += 1
585      flags = PSL_FLAG_WILDCARD | PSL_FLAG_PLAIN | section
586      line = line[2:]
587    else:
588      psl_nsuffixes += 1
589      flags = PSL_FLAG_PLAIN | section
590
591    punycode = line.decode('utf-8').encode('idna')
592
593    if punycode in psl:
594      """Found existing entry:
595         Combination of exception and plain rule is ambiguous
596           !foo.bar
597            foo.bar
598
599         Allowed:
600           !foo.bar + *.foo.bar
601            foo.bar + *.foo.bar
602      """
603      print('Found %s/%X (now %X)' % punycode, psl[punycode], flags)
604      continue
605
606    if utf_mode:
607      psl[line] = flags
608    psl[punycode] = flags
609
610#  with open("psl.out", 'w') as outfile:
611#    for (domain, flags) in sorted(psl.iteritems()):
612#      outfile.write(domain + "%X" % (flags & 0x0F) + "\n")
613
614  return [domain + bytes('%X' % (flags & 0x0F), **codecs) for (domain, flags) in sorted(psl.items())]
615
616
617def usage():
618  """Prints the usage"""
619  print('usage: %s [options] infile outfile' % sys.argv[0])
620  print('  --output-format=cxx     Write DAFSA as C/C++ code (default)')
621  print('  --output-format=cxx+    Write DAFSA as C/C++ code plus statistical assignments')
622  print('  --output-format=binary  Write DAFSA binary data')
623  print('  --encoding=ascii        7-bit ASCII mode')
624  print('  --encoding=utf-8        UTF-8 mode (default)')
625  exit(1)
626
627
628def main():
629  """Convert PSL file into C or binary DAFSA file"""
630  if len(sys.argv) < 3:
631    usage()
632
633  converter = words_to_cxx
634  parser = parse_psl
635  utf_mode = True
636
637  codecs = dict()
638  if sys.version_info.major > 2:
639    codecs['encoding'] = 'utf-8'
640
641  for arg in sys.argv[1:-2]:
642    # Check --input-format for backward compatibility
643    if arg.startswith('--input-format='):
644      value = arg[15:].lower()
645      if value == 'psl':
646        parser = parse_psl
647      else:
648        print("Unknown input format '%s'" % value)
649        return 1
650    elif arg.startswith('--output-format='):
651      value = arg[16:].lower()
652      if value == 'binary':
653        converter = words_to_binary
654      elif value == 'cxx':
655        converter = words_to_cxx
656      elif value == 'cxx+':
657        converter = words_to_cxx_plus
658      else:
659        print("Unknown output format '%s'" % value)
660        return 1
661    elif arg.startswith('--encoding='):
662      value = arg[11:].lower()
663      if value == 'ascii':
664        utf_mode = False
665      elif value == 'utf-8':
666        utf_mode = True
667      else:
668        print("Unknown encoding '%s'" % value)
669        return 1
670    else:
671      usage()
672
673  if sys.argv[-2] == '-':
674    with open(sys.argv[-1], 'wb') as outfile:
675      outfile.write(converter(parser(sys.stdin, utf_mode, codecs), utf_mode, codecs))
676  else:
677    """Some statistical data for --output-format=cxx+"""
678    global psl_input_file, psl_nsuffixes, psl_nexceptions, psl_nwildcards
679
680    psl_input_file = sys.argv[-2]
681    psl_nsuffixes = 0
682    psl_nexceptions = 0
683    psl_nwildcards = 0
684
685    with open(sys.argv[-2], 'r', **codecs) as infile, open(sys.argv[-1], 'wb') as outfile:
686      outfile.write(converter(parser(infile, utf_mode, codecs), utf_mode, codecs))
687
688  return 0
689
690
691if __name__ == '__main__':
692  sys.exit(main())
693