1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 2<html lang="en"> 3 <head> 4 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 5 <title>MARISA: Matching Algorithm with Recursively Implemented StorAge</title> 6 <link rel="stylesheet" type="text/css" href="./style.css"> 7 </head> 8 <body> 9 <div id="header"> 10 <div class="left">MARISA: Matching Algorithm with Recursively Implemented StorAge</div> 11 <div class="right">Last modified: 14 Jun 2020</div> 12 <div class="end"></div> 13 </div><!-- header --> 14 <div id="body" style="text-align: justify"> 15 <h1>MARISA: Matching Algorithm with Recursively Implemented StorAge</h1> 16 <p id="abstract"> 17 <span id="heading">Abstract: </span> 18 Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient trie data structure. libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance. 19 </p><!-- abstract --> 20 21 <div class="section"> 22 <h2><a name="introduction">Introduction</a></h2> 23 <div class="subsection"> 24 <h3><a name="overview">Overview</a></h3> 25 <p> 26 Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient, fairly fast, and static trie data structure. MARISA serves as a dictionary structure, and by definition, it supports exact match lookup, which is the basic operation of dictionary. In addition, MARISA supports reverse lookup, common prefix search, and predictive search. 27 </p> 28 <p> 29 In most cases, MARISA is much more compact than a plain text which consists of the registered keys. This means that the traditional dictionary implementations, a binary tree (<code>std::map<std::string, T></code>) and a hash table (<code>std::unordered_map<std::string, T></code>), require more and more and more spaces than MARISA. Bloom Filter, a probabilistic data structure, is more space-efficient than MARISA but causes false positives and does not support reverse lookup, common prefix search, and predictive search. 30 </p> 31 <p> 32 libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance. 33 </p> 34 </div><!-- subsection --> 35 <div class="subsection"> 36 <h3><a name="ability">Functionality</a></h3> 37 <p> 38 libmarisa associates string keys with unique IDs, from <var>0</var> to <var>(n - 1)</var>, where <var>n</var> is the number of keys. Note that users cannot specify the IDs because the mapping is automatically generated by MARISA. Every search function takes a string or an ID and returns the search result which is represented by a pair of the key and its ID. 39 </p> 40 <ul> 41 <li>Lookup 42 <ul> 43 <li>checks whether or not a query string is registered.</li> 44 </ul> 45 </li> 46 <li>Reverse lookup 47 <ul> 48 <li>restores a key from its ID.</li> 49 </ul> 50 </li> 51 <li>Common prefix search 52 <ul> 53 <li>searches keys from the possible prefixes of a query string.</li> 54 </ul> 55 </li> 56 <li>Predictive search 57 <ul> 58 <li>searches keys starting with a query string.</li> 59 </ul> 60 </li> 61 </ul> 62 </div><!-- subsection --> 63 </div><!-- section --> 64 <div class="section"> 65 <h2><a name="source">Source</a></h2> 66 <div class="subsection"> 67 <h3><a name="license">License</a></h3> 68 <p> 69 libmarisa and its command line tools are dual-licensed under the BSD 2-clause license and the LGPL. 70 </p> 71 </div><!-- subsection --> 72 <div class="subsection"> 73 <h3><a name="download">Download</a></h3> 74 <p> 75 The project is hosted on <a href="https://github.com/">GitHub</a>. 76 </p> 77 <ul> 78 <li>Project 79 <ul> 80 <li><a href="https://github.com/s-yata/marisa-trie">https://github.com/s-yata/marisa-trie</a></li> 81 </ul> 82 </li> 83 <li>Source 84 <ul> 85 <li><a href="https://github.com/s-yata/marisa-trie/archive/v0.2.6.tar.gz">marisa-0.2.6.tar.gz</a></li> 86 </ul> 87 </li> 88 </ul> 89 </div><!-- subsection --> 90 </div><!-- section --> 91 92 <div class="section"> 93 <h2><a name="install">Installation</a></h2> 94 <div class="subsection"> 95 <h3><a name="gcc">GCC & Clang</a></h3> 96 <div class="float"> 97 <pre class="console">$ tar zxf marisa-0.2.6.tar.gz 98$ cd marisa-0.2.6 99$ ./configure 100$ make 101$ make check 102$ make install</pre> 103 </div><!-- float --> 104 <p> 105 Users can install libmarisa by using <kbd>configure</kbd> and <kbd>make</kbd>. <kbd>make install</kbd> might require <kbd>sudo</kbd> to install libmarisa as the root user. Additionally, <kbd>ldconfig</kbd> might be required because libmarisa is installed as a shared library in default settings. 106 </p> 107 <p> 108 If a POPCNT instruction is available on your environment, you can specify <kbd>--enable-popcnt</kbd>, when you run <kbd>configure</kbd>, to improve the performance of libmarisa. Likewise, <kbd>--enable-sse2</kbd>, <kbd>--enable-sse3</kbd>, <kbd>--enable-ssse3</kbd>, <kbd>--enable-sse4.1</kbd>, <kbd>--enable-sse4.2</kbd>, <kbd>--enable-sse4</kbd>, <kbd>--enable-sse4a</kbd>, <kbd>--enable-bmi</kbd>, <kbd>--enable-bmi2</kbd> are available. Note that <kbd>--enable-native-code</kbd> enables instructions available on the compilation environment. Also, if you need a static library, specify <kbd>--enable-static</kbd> to <kbd>configure</kbd>. For other options, see <kbd>./configure --help</kbd>. 109 </p> 110 </div><!-- subsection --> 111 <div class="subsection"> 112 <h3><a name="vc">Visual C++ 2008</a></h3> 113 <p> 114 There are project files for Visual C++ 2008 in <kbd>vs2008/</kbd>. Users can build a static library <kbd>libmarisa.lib</kbd> and the command line tools by using <kbd>vs2008/vs2008.sln</kbd>. If your Visual C++ is older than 2008. New projects are required to build libmarisa. 115 </p> 116 </div><!-- subsection --> 117 <div class="subsection"> 118 <h3><a name="vc">Perl Bindings</a></h3> 119 <div class="float"> 120 <pre class="console">$ cd bindings/perl 121$ perl Makefile.PL 122$ make 123$ make install</pre> 124 </div><!-- float --> 125 <p> 126 Users can find a Perl bindings in <kbd>bindings/perl/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Perl bindings, run <kbd>perl Makefile.PL</kbd> and then <kbd>make install</kbd>. See also <kbd>bindings/perl/sample.pl</kbd>. 127 </p> 128 </div><!-- subsection --> 129 <div class="subsection"> 130 <h3><a name="vc">Python Bindings</a></h3> 131 <div class="float"> 132 <pre class="console">$ cd bindings/python 133$ python setup.py build 134$ python setup.py install</pre> 135 </div><!-- float --> 136 <p> 137 Users can find a Python bindings in <kbd>bindings/python/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Python bindings, run <kbd>python setup.py install</kbd>. See also <kbd>bindings/python/sample.py</kbd>. 138 </p> 139 </div><!-- subsection --> 140 <div class="subsection"> 141 <h3><a name="vc">Ruby Bindings</a></h3> 142 <div class="float"> 143 <pre class="console">$ cd bindings/ruby 144$ ruby extconf.rb 145$ make 146$ make install</pre> 147 </div><!-- float --> 148 <p> 149 Users can find a Ruby bindings in <kbd>bindings/ruby/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Ruby bindings, run <kbd>ruby extconf.rb</kbd> and then <kbd>make install</kbd>. See also <kbd>bindings/ruby/sample.rb</kbd>. 150 </p> 151 </div><!-- subsection --> 152 <div class="subsection"> 153 <h3><a name="vc">Others</a></h3> 154 <p> 155 There are some other bindings. 156 </p> 157 <ul> 158 <li>Python 159 <ul> 160 <li><a href="https://github.com/kmike/marisa-trie/">https://github.com/kmike/marisa-trie/</a> an alternative Cython-based pip-installable Python bindings which is faster than the above Python bindings.</li> 161 </ul> 162 </li> 163 <li>Node.js</li> 164 <ul> 165 <li><a href="https://github.com/jakwings/iojs-marisa-trie">https://github.com/jakwings/iojs-marisa-trie</a> a wrapper of marisa-trie</li> 166 </ul> 167 </li> 168 </ul> 169 </div><!-- subsection --> 170 </div><!-- section --> 171 172 <div class="section"> 173 <h2><a name="tools">Command Line Tools</a></h2> 174 <div class="subsection"> 175 <h3><a name="marisa-build">marisa-build</a></h3> 176 <div class="float"> 177 <pre class="console">$ marisa-build < keyset.txt > keyset.dic 178#keys: 9864459 179#nodes: 13473881 180size: 51044424</pre> 181 </div><!-- float --> 182 <p> 183 <kbd>marisa-build</kbd> is a tool to build a dictionary from a set of keys. This tool takes as input newline-delimited keys and writes the dictionary to the standard output. 184 </p> 185 <p> 186 Users can specify parameters through command line options. See <kbd>marisa-build -h</kbd> for the list of options. 187 </p> 188 <p> 189 If an input line contains horizontal tabs, the last one serves as the delimiter between a key and its weight which is used to optimize the order of nodes. Estimated frequency of each key, given as the weight, may improve the search performance. 190 </p> 191 </div><!-- subsection --> 192 <div class="subsection"> 193 <h3><a name="marisa-lookup">marisa-lookup</a></h3> 194 <div class="float"> 195 <pre class="console">$ marisa-lookup keyset.dic 196Marisa 197915465 Marisa 198What's_uuup 199-1 What's_uuup</pre> 200 </div><!-- float --> 201 <p> 202 <kbd>marisa-lookup</kbd> is a tool to test exact match lookup. If a query string is registered, this tool prints the key and its ID. Otherwise, this tool prints the query with <var>-1</var>. 203 </p> 204 <p> 205 See <kbd>marisa-lookup -h</kbd> for the list of options. 206 </p> 207 </div><!-- subsection --> 208 <div class="subsection"> 209 <h3><a name="marisa-reverse-lookup">marisa-reverse-lookup</a></h3> 210 <div class="float"> 211 <pre class="console">$ marisa-reverse-lookup keyset.dic 2121234567 2131234567 Goma_International_Airport</pre> 214 </div><!-- float --> 215 <p> 216 <kbd>marisa-reverse-lookup</kbd> is a tool to test reverse lookup. If a given ID is not out-of-range, this tool restores the associated key and prints it. The available ID is <var>0</var> to <var>(n - 1)</var>, where <var>n</var> is the number of keys. Note that an out-of-range ID causes an error. 217 </p> 218 <p> 219 See <kbd>marisa-reverse-lookup -h</kbd> for the list of options. 220 </p> 221 </div><!-- subsection --> 222 <div class="subsection"> 223 <h3><a name="marisa-common-prefix-search">marisa-common-prefix-search</a></h3> 224 <div class="float"> 225 <pre class="console">$ marisa-common-prefix-search keyset.dic 226USA 2273 found 22820 U USA 2291526 US USA 23037471 USA USA</pre> 231 </div><!-- float --> 232 <p> 233 <kbd>marisa-common-prefix-search</kbd> is a tool to test common prefix search. This tool searches keys from the possible prefixes of a query string and then prints the first <var>m</var> keys, where <var>m</var> is one of the parameters. 234 </p> 235 <p> 236 See <kbd>marisa-common-prefix-search -h</kbd> for the list of options. 237 </p> 238 </div><!-- subsection --> 239 <div class="subsection"> 240 <h3><a name="marisa-predictive-search">marisa-predictive-search</a></h3> 241 <div class="float"> 242 <pre class="console">$ marisa-predictive-search keyset.dic -n 2 243Touhou 24415 found 245975378 Touhou Touhou 2465508004 Touhou_Hisotensoku Touhou</pre> 247 </div><!-- float --> 248 <p> 249 <kbd>marisa-predictive-search</kbd> is a tool to test predictive search. This tool searches keys starting with a query string and then prints the first <var>m</var> keys, where <var>m</var> is one of the parameters. 250 </p> 251 <p> 252 See <kbd>marisa-predictive-search -h</kbd> for the list of options. 253 </p> 254 </div><!-- subsection --> 255 <div class="subsection"> 256 <h3><a name="marisa-benchmark">marisa-benchmark</a></h3> 257 <div class="float"> 258 <pre class="console">$ marisa-benchmark keyset.txt 259Number of tries: 1 - 5 260TAIL mode: Text mode 261Node order: Descending weight order 262Cache level: Normal cache 263Number of keys: 9864459 264Total length: 191858227 265------+----------+--------+--------+--------+--------+-------- 266#tries size build lookup reverse prefix predict 267 lookup search search 268 [bytes] [K/s] [K/s] [K/s] [K/s] [K/s] 269------+----------+--------+--------+--------+--------+-------- 270 1 69905816 334.84 1368.16 1304.82 1080.44 605.92 271 2 53635744 284.03 762.91 773.68 662.04 244.35 272 3 51044424 278.89 688.86 703.60 604.44 212.00 273 4 50309000 277.01 669.23 680.78 588.57 204.23 274 5 50042232 275.93 636.83 674.26 562.08 199.48 275------+----------+--------+--------+--------+--------+--------</pre> 276 </div><!-- float --> 277 <p> 278 <kbd>marisa-benchmark</kbd> is a tool to benchmark libmarisa. This tool takes the same input as <kbd>marisa-build</kbd> and measures the performance of libmarisa for the given set of keys. This tool is useful to fix dictionary settings. 279 </p> 280 <p> 281 For the search performance, <kbd>marisa-benchmark</kbd> measures the time to lookup or search keys in input order. When the keys are given in lexicographic order, few cache misses will occur in the benchmark. In contrast, when the keys are given in random order, many cache misses will occur in the benchmark. 282 </p> 283 <p> 284 See <kbd>marisa-benchmark -h</kbd> for the list of options. 285 </p> 286 </div><!-- subsection --> 287 <div class="subsection"> 288 <h3><a name="marisa-dump">marisa-dump</a></h3> 289 <div class="float"> 290 <pre class="console">$ marisa-dump keyset.dic | head -3 291input: keyset.dic 292S 293St 294Sta</pre> 295 </div><!-- float --> 296 <p> 297 <kbd>marisa-build</kbd> is a tool to dump a dictionary. This tool prints all the keys in a given dictionary. 298 </p> 299 <p> 300 Users can specify the delimiter through command line options. See <kbd>marisa-dump -h</kbd> for the list of options. 301 </p> 302 </div><!-- subsection --> 303 </div><!-- section --> 304 305 <div class="section"> 306 <h2><a name="library">Library</a></h2> 307 <div class="subsection"> 308 <h3><a name="howto">How to Use</a></h3> 309 <div class="float"> 310 <pre class="code">// sample.cc 311#include <iostream> 312#include <marisa.h> 313 314int main() { 315 marisa::Keyset keyset; 316 keyset.push_back("a"); 317 keyset.push_back("app"); 318 keyset.push_back("apple"); 319 320 marisa::Trie trie; 321 trie.build(keyset); 322 323 marisa::Agent agent; 324 agent.set_query("apple"); 325 while (trie.common_prefix_search(agent)) { 326 std::cout.write(agent.key().ptr(), agent.key().length()); 327 std::cout << ": " << agent.key().id() << std::endl; 328 } 329 return 0; 330}</pre> 331 </div><!-- float --> 332 <div class="float"> 333 <pre class="console">$ g++ sample.cc -lmarisa 334$ ./a.out 335a: 0 336app: 1 337apple: 2</pre> 338 </div><!-- float --> 339 <p> 340 libmarisa provides <kbd>marisa.h</kbd> in which all the headers are <code>#include</code>d. Also, libmarisa uses <code>namespace marisa</code>. All the classes and functions except enumeration types are given as members of this namespace. Note that <code>using namespace marisa</code> may cause a critical error. Finally, <kbd>gcc</kbd> and <kbd>clang</kbd> require an option, <kbd>-lmarisa</kbd>, to link libmarisa with an application. 341 </p> 342 <p> 343 The core components of libmarisa are <a href="#keyset">Keyset</a>, <a href="#agent">Agent</a>, and <a href="#trie">Trie</a>. In addition, libmarisa provides an exception class, <a href="#exception">Exception</a>, and two more classes, <a href="#key">Key</a> and <a href="#query">Query</a>, as members of <code>Keyset</code> and <code>Agent</code>. 344 </p> 345 <ul> 346 <li><code>Keyset</code>: A class to store a set of keys. This class is used to build a set of keys for building a dictionary. Also, this class is useful to store search results.</li> 347 <li><code>Agent</code>: A class to store a query and a result of search operations. Every search function takes a reference to this class.</li> 348 <li><code>Trie</code>: A dictionary class.</li> 349 </ul> 350 <p> 351 For more examples, you can find the source code of the command line tools in <kbd>tools/</kbd>. The source code is useful as an example of error handling, predicive search, etc. 352 </p> 353 </div><!-- subsection --> 354 355 <div class="subsection"> 356 <h3><a name="enum">Enumeration Constants</a></h3> 357 <div class="subsubsection"> 358 <h4>Error Codes</h4> 359 <div class="float"> 360 <pre class="code">typedef enum marisa_error_code_ { 361 MARISA_OK = 0, 362 MARISA_STATE_ERROR = 1, 363 MARISA_NULL_ERROR = 2, 364 MARISA_BOUND_ERROR = 3, 365 MARISA_RANGE_ERROR = 4, 366 MARISA_CODE_ERROR = 5, 367 MARISA_RESET_ERROR = 6, 368 MARISA_SIZE_ERROR = 7, 369 MARISA_MEMORY_ERROR = 8, 370 MARISA_IO_ERROR = 9, 371 MARISA_FORMAT_ERROR = 10, 372} marisa_error_code;</pre> 373 </div><!-- float --> 374 <p> 375 libmarisa throws an instance of <code>Exception</code> when an error occurs, such as a file I/O error (<var>MARISA_IO_ERROR</var>), a size limitation error (<var>MARISA_SIZE_ERROR</var>), etc. For details, see <kbd>marisa/base.h</kbd>. 376 </p> 377 </div><!-- subsubsection --> 378 <div class="subsubsection"> 379 <h4>Number of Tries</h4> 380 <div class="float"> 381 <pre class="code"> 382typedef enum marisa_num_tries_ { 383 MARISA_MIN_NUM_TRIES = 0x00001, 384 MARISA_MAX_NUM_TRIES = 0x0007F, 385 MARISA_DEFAULT_NUM_TRIES = 0x00003, 386} marisa_num_tries;</pre> 387 </div><!-- float --> 388 <p> 389 MARISA is a recursive data structure in which a patricia trie is used to represent another patricia trie. A deeper recursion makes a dictionary more compact but degrades the search performance. For this time-space tradeoff, libmarisa provides a parameter to limit the recursion depth, which is equivalent to the number of tries. <code>marisa_num_tries</code> gives the range and the default setting of this parameter. 390 </p> 391 <p> 392 The best setting depends on the set of keys and the applications. In most cases, libmarisa works well with the default setting, <var>MARISA_DEFAULT_NUM_TRIES</var>, but if the application requires better search performance, <var>MARISA_MIN_NUM_TRIES</var> may be a better choice. Also, if the application uses long and complicated keys, a deeper recursion may achieve much higher spece-efficiency. <kbd>marisa-benchmark</kbd> is useful to find the best setting. 393 </p> 394 </div><!-- subsubsection --> 395 <div class="subsubsection"> 396 <h4>Cache Size</h4> 397 <div class="float"> 398 <pre class="code">typedef enum marisa_cache_level_ { 399 MARISA_HUGE_CACHE = 0x00080, 400 MARISA_LARGE_CACHE = 0x00100, 401 MARISA_NORMAL_CACHE = 0x00200, 402 MARISA_SMALL_CACHE = 0x00400, 403 MARISA_TINY_CACHE = 0x00800, 404 MARISA_DEFAULT_CACHE = MARISA_NORMAL_CACHE 405} marisa_cache_level;</pre> 406 </div><!-- float --> 407 <p> 408 libmarisa embeds a precomputed table to a dictionary. The table serves as transition cache which improves the search performance but increases the dictionary size. Cache size is the parameter of this time-space tradeoff. 409 </p> 410 <p> 411 <code>marisa_cache_level</code> gives a list of available cache size. Compared with <var>MARISA_NORMAL_CACHE</var>, <var>MARISA_LARGE_CACHE</var> is 2 times larger, <var>MARISA_HUGE_CACHE</var> is 4 times larger, <var>MARISA_SMALL_CACHE</var> is 2 times smaller, and <var>MARISA_TINY_CACHE</var> is 4 times smaller. 412 </p> 413 </div><!-- subsubsection --> 414 <div class="subsubsection"> 415 <h4>TAIL Mode</h4> 416 <div class="float"> 417 <pre class="code">typedef enum marisa_tail_mode_ { 418 MARISA_TEXT_TAIL = 0x01000, 419 MARISA_BINARY_TAIL = 0x02000, 420 MARISA_DEFAULT_TAIL = MARISA_TEXT_TAIL, 421} marisa_tail_mode;</pre> 422 </div><!-- float --> 423 <p> 424 The last patricia trie of MARISA stores its multi-byte labels as strings and <code>marisa_tail_mode</code> gives a list of TAIL implementations. 425 </p> 426 <p> 427 <var>MARISA_TEXT_TAIL</var> stores labels as zero-terminated strings. If the labels contain <var>'\0'</var>, the TAIL mode is automatically switched to <var>MARISA_BINARY_TAIL</var>. 428 </p> 429 <p> 430 On the other hand, <var>MARISA_BINARY_TAIL</var> uses a bit vector, instead of <var>'\0'</var>, to detect the end of labels. This means that <var>MARISA_TEXT_TAIL</var> is more space-efficient than <var>MARISA_BINARY_TAIL</var> when the average length of multi-byte labels is longer than <var>8 bytes</var>. 431 </p> 432 </div><!-- subsubsection --> 433 <div class="subsubsection"> 434 <h4>Node Order</h4> 435 <div class="float"> 436 <pre class="code">typedef enum marisa_node_order_ { 437 MARISA_LABEL_ORDER = 0x10000, 438 MARISA_WEIGHT_ORDER = 0x20000, 439 MARISA_DEFAULT_ORDER = MARISA_WEIGHT_ORDER, 440} marisa_node_order;</pre> 441 </div><!-- float --> 442 <p> 443 A dictionary has one more parameter, which is the order of nodes. There are two choices, <var>MARISA_LABEL_ORDER</var> and <var>MARISA_WEIGHT_ORDER</var>. The former arranges nodes in ascending order of the label and the latter arranges nodes in descending order of the weight. Many trie implementations arrange nodes in the label order but libmarisa uses <var>MARISA_WEIGHT_ORDER</var> as the default setting. 444 </p> 445 <p> 446 <var>MARISA_WEIGHT_ORDER</var> optimizes the node order for linear search performed in exact match lookup, common prefix search, and predictive search. In practice, experiments for English words/phrases showed that <var>MARISA_WEIGHT_ORDER</var> halved the average search time. On the other hand, <var>MARISA_LABEL_ORDER</var> enables predictive search to restore keys in lexicographic order. 447 </p> 448 </div><!-- subsubsection --> 449 <div class="subsubsection"> 450 <h4>Aliases</h4> 451 <div class="float"> 452 <pre class="code">namespace marisa { 453 typedef ::marisa_error_code ErrorCode; 454 typedef ::marisa_cache_level CacheLevel; 455 typedef ::marisa_tail_mode TailMode; 456 typedef ::marisa_node_order NodeOrder; 457} // namespace marisa</pre> 458 </div><!-- float --> 459 <p> 460 The above enumeration types are defined in the global namespace to avoid collisions of the enumeration constants with macros provided by other modules. libmarisa provides type aliases and users can choose the familiar one. 461 </p> 462 </div><!-- subsubsection --> 463 </div><!-- subsection --> 464 465 <div class="subsection"> 466 <h3><a name="exception">class Exception</a></h3> 467 <div class="float"> 468 <pre class="code">class Exception { 469 public: 470 const char *filename() const; 471 int line() const; 472 ErrorCode error_code() const; 473 const char *error_message() const; 474 475 const char *what() const; 476};</pre> 477 </div><!-- float --> 478 <p> 479 <code>Exception</code> is an exception class. libmarisa throws an instance of <code>Exception</code> with the file name (<code>__FILE__</code>), the line number (<code>__LINE__</code>), and an error code (<code>ErrorCode</code>) when an error is detected. The instance also has an error message formatted <var>__FILE__:__LINE__: error_code: error_message</var>. 480 </p> 481 </div><!-- subsection --> 482 483 <div class="subsection"> 484 <h3><a name="key">class Key</a></h3> 485 <div class="float"> 486 <pre class="code">class Key { 487 public: 488 char operator[](std::size_t i) const; 489 const char *ptr() const; 490 std::size_t length() const; 491 std::size_t id() const; 492};</pre> 493 </div><!-- float --> 494 <p> 495 <code>Key</code> is a member of <a href="#keyset">Keyset</a> and <a href="#agent">Agent</a>. Each key of <code>Keyset</code> is represented by this class. Also, the search result of <code>Agent</code> is represented by this class. 496 </p> 497 </div><!-- subsection --> 498 499 <div class="subsection"> 500 <h3><a name="query">class Query</a></h3> 501 <div class="float"> 502 <pre class="code">class Query { 503 public: 504 char operator[](std::size_t i) const; 505 const char *ptr() const; 506 std::size_t length() const; 507 std::size_t id() const; 508};</pre> 509 </div><!-- float --> 510 <p> 511 <code>Query</code> is a member of <a href="#agent">Agent</a>. This class stores a query string and an ID as input for search functions. Users cannot make changes directly to <code>Query</code> because <code>Agent</code> provides a special interface to update its query. 512 </p> 513 </div><!-- subsection --> 514 515 <div class="subsection"> 516 <h3><a name="keyset">class Keyset</a></h3> 517 <div class="float"> 518 <pre class="code">class Keyset { 519 public: 520 Keyset(); 521 522 void push_back(const Key &key); 523 void push_back(const Key &key, char end_marker); 524 525 void push_back(const char *str); 526 void push_back(const char *ptr, 527 std::size_t length, 528 float weight = 1.0); 529 530 const Key &operator[](std::size_t i) const; 531 Key &operator[](std::size_t i); 532 533 std::size_t num_keys(); 534 535 bool empty() const; 536 std::size_t size() const; 537 std::size_t total_length() const; 538 539 void reset(); 540 541 void clear(); 542 void swap(Keyset &rhs); 543};</pre> 544 </div><!-- float --> 545 <div class="subsubsection"> 546 <h4>Overview</h4> 547 <p> 548 <code>Keyset</code> is used to store a set of keys for dictionary construction or to save the results of search functions. 549 </p> 550 </div><!-- subsubsection --> 551 <div class="subsubsection"> 552 <h4>Dictionary Source</h4> 553 <p> 554 For dictionary construction, users append keys to <code>Keyset</code> by using <code>push_back()</code> and then pass the keyset to <code>build()</code> of <a href="#trie">Trie</a>. <var>weight</var> is an argument to receive the frequency or possibility of each key. If there are same keys, the weights are accumulated in dictionary construction. 555 </p> 556 <p> 557 After dictionary construction, users can read the associated IDs through <code>operator[]()</code>. Instead, the weights are overwritten by the IDs because <code>Key</code> uses a <code>union</code> to store a weight or an ID. 558 </p> 559 </div><!-- subsubsection --> 560 <div class="subsubsection"> 561 <h4>Search Result</h4> 562 <p> 563 Users can save a search result to <code>Keyset</code> by using <code>push_back()</code>. When <code>key()</code> of <a href="#agent">Agent</a> is given, a copy of the search result is stored in <code>Keyset</code>. If you want to append an end marker, such as <code>'\0'</code>, use <var>end_marker</var> of <code>push_back()</code>. 564 </p> 565 <p> 566 If you want to reuse an instance of <code>Keyset</code>, <code>reset()</code> may be a better choice than <code>clear()</code> because <code>reset()</code> keeps allocated memory in order to reduce memory allocation overhead. 567 </p> 568 <p> 569 <code>num_keys()</code> and <code>size()</code> return the number of keys. <code>empty()</code> checks whether the number of keys is <var>0</var> or not. <code>total_length()</code> returns the total length in byte. 570 </p> 571 </div><!-- subsubsection --> 572 </div><!-- subsection --> 573 574 <div class="subsection"> 575 <h3><a name="agent">class Agent</a></h3> 576 <div class="float"> 577 <pre class="code">class Agent { 578 public: 579 Agent(); 580 581 const Query &query() const; 582 const Key &key() const; 583 584 void set_query(const char *str); 585 void set_query(const char *ptr, 586 std::size_t length); 587 void set_query(std::size_t key_id); 588};</pre> 589 </div><!-- float --> 590 <p> 591 <code>Agent</code> is actually a tuple of <code>Query</code>, <code>Key</code>, and <code>State</code>. This class is used as I/O of search functions. Also, <code>State</code> is an incomplete type to keep the internal state of search operation. 592 </p> 593 <p> 594 A lookup operation requires 3 steps as follows: 1. sets a query string by <code>set_query()</code> of <code>Agent</code>, 2. passes the agent to <code>lookup()</code> of <code>Trie</code>, and 3. gets the search result by <code>key()</code> of <code>Agent</code>. The other operations proceed in the same way. 595 </p> 596 </div><!-- subsection --> 597 598 <div class="subsection"> 599 <h3><a name="trie">class Trie</a></h3> 600 <div class="float"> 601 <pre class="code">class Trie { 602 public: 603 Trie(); 604 605 void build(Keyset &keyset, 606 int config_flags = 0); 607 608 void mmap(const char *filename); 609 void map(const void *ptr, 610 std::size_t size); 611 612 void load(const char *filename); 613 void read(int fd); 614 615 void save(const char *filename) const; 616 void write(int fd) const; 617 618 bool lookup(Agent &agent) const; 619 void reverse_lookup(Agent &agent) const; 620 bool common_prefix_search(Agent &agent) const; 621 bool predictive_search(Agent &agent) const; 622 623 std::size_t num_tries() const; 624 std::size_t num_keys() const; 625 std::size_t num_nodes() const; 626 627 TailMode tail_mode() const; 628 NodeOrder node_order() const; 629 630 bool empty() const; 631 std::size_t size() const; 632 std::size_t io_size() const; 633 634 void clear(); 635 void swap(Trie &rhs); 636};</pre> 637 </div><!-- float --> 638 <div class="subsubsection"> 639 <h4>Overview</h4> 640 <p> 641 <code>Trie</code> is a dictionary class, which is the most important component of libmarisa. All the operations are performed through this class. 642 </p> 643 <p> 644 In fact, <code>Trie</code> is a dictionary handle, and if the handle is invalid, functions other than <code>build()</code>, <code>mmap()</code>, <code>map()</code>, <code>load()</code>, <code>read()</code>, <code>clear()</code>, <code>swap()</code> throw an exception. 645 </p> 646 </div><!-- subsubsection --> 647 <div class="subsubsection"> 648 <h4>Construction</h4> 649 <p> 650 You can build a dictionary by using <code>build()</code>. The arguments are the above mentioned <a href="#keyset">Keyset</a> and a dictionary setting, <var>config_flags</var>, which is represented by a combination of flags. For example, <var>2 | MARISA_BINARY_TAIL</var> specifies the maximum number of tries (<var>2</var>) and a TAIL mode (<var>MARISA_BINARY_TAIL</var>). Also, in this case, the default settings, <var>MARISA_DEFAULT_ORDER</var> and <var>MARISA_DEFAULT_CACHE</var>, are used for the node order and the cache size. 651 </p> 652 <p> 653 The IDs associated with the keys are available through <code>operator[]()</code> of <var>keyset</var>, and the IDs are useful to associate the keys with any data types. 654 </p> 655 </div><!-- subsubsection --> 656 <div class="subsubsection"> 657 <h4>File I/O</h4> 658 <p> 659 <code>mmap()</code> is an interface for memory mapped I/O. If an application performs a few search operations, it is unnecessary to read the whole dictionary, and in such a case, <code>mmap()</code> is useful. Also, memory mapped I/O is an easy way to share dictionary data among processes. On the other hand, if an application performs a lot of search operations, a memory mapped dictionary might cause a lot of random disk accesses which considerably increase the search time. 660 </p> 661 <p> 662 <code>map()</code> restores an instance of <code>Trie</code> from dictionary data on memory. <code>load()</code> and <code>read()</code> read a dictionary from a file or a file descriptor. <code>save()</code> and <code>write()</code> write a dictionary to a file or a file descriptor. 663 </p> 664 </div><!-- subsubsection --> 665 <div class="subsubsection"> 666 <h4>Search</h4> 667 <p> 668 <code>Trie</code> provides 4 search functions <code>lookup()</code>, <code>reverse_lookup()</code>, <code>common_prefix_search()</code>, and <code>predictive_search()</code> as follows: 669 </p> 670 <ul> 671 <li> 672 <code>lookup()</code> checks whether a query string is registered or not, and if it is registered, <code>lookup()</code> returns <var>true</var>. In this case, the search result is available through <code>agent.key()</code>. Note that <code>lookup()</code> does not restore a key and <code>agent.key().ptr()</code> points to the query string because the two strings are the same. 673 </li> 674 <li> 675 <code>reverse_lookup()</code> restores a key from its ID. This function has no return value and the key is available through <var>agent.key()</var>. The key is actually stored in <var>agent</var> and it is lost when <var>agent</var> is reset or used for another search operation. If a given ID is out-of-range, <code>reverse_lookup()</code> throws an exception. 676 </li> 677 <li> 678 <code>common_prefix_search()</code> searches keys from the possible prefixes of a query string. If there are matching keys, this function returns <var>true</var>. In this case, the first key is available through <code>agent.key()</code>, and if there are more than one matching keys, the next key will be available after the next <code>common_prefix_search()</code> which returns <var>true</var> until there are no more matching keys. Note that <code>agent.key().ptr() == agent.query().ptr()</code> is always <var>true</var> when <code>common_prefix_search()</code> has returned <var>true</var>. 679 </li> 680 <li> 681 <code>predictive_search()</code> searches keys starting with a query string, and similar to <code>common_prefix_search()</code>, this function returns <var>true</var> until there are no more matching keys. 682 </li> 683 </ul> 684 <p> 685 Note that <code>agent</code> keeps the internal state of <code>common_prefix_search()</code> and <code>predictive_search()</code> until <code>agent</code> is passed to another search function or <code>agent.set_query()</code> is called. 686 </p> 687 <p> 688 <code>num_keys()</code> and <code>size()</code> return the number of keys. <code>empty()</code> checks whether the number of keys is <var>0</var> or not. <code>io_size()</code> returns the dictionary size in byte. 689 </p> 690 </div><!-- subsubsection --> 691 </div><!-- subsection --> 692 693 <div class="subsection"> 694 <h3><a name="stdio">stdio</a></h3> 695 <div class="float"> 696 <pre class="code">void fread(std::FILE *file, Trie *trie); 697void fwrite(std::FILE *file, const Trie &trie);</pre> 698 </div><!-- float --> 699 <p> 700 The functions for I/O using <code>std::FILE</code> are declared in <kbd>marisa/stdio.h</kbd>. If you don't want to <code>#include <cstdio></code>, use <kbd>marisa/trie.h</kbd> instead of <kbd>marisa.h</kbd>. 701 </p> 702 </div><!-- subsection --> 703 704 <div class="subsection"> 705 <h3><a name="iostream">iostream</a></h3> 706 <div class="float"> 707 <pre class="code">std::istream &read(std::istream &stream, Trie *trie); 708std::ostream &write(std::ostream &stream, const Trie &trie); 709 710std::istream &operator>>(std::istream &stream, Trie &trie); 711std::ostream &operator<<(std::ostream &stream, const Trie &trie);</pre> 712 </div><!-- float --> 713 <p> 714 The functions for I/O using <code>std::iostream</code> are declared in <kbd>marisa/iostream.h</kbd>. If you don't want to <code>#include <iosfwd></code>, use <kbd>marisa/trie.h</kbd> instead of <kbd>marisa.h</kbd>. 715 </p> 716 </div><!-- subsection --> 717 </div><!-- section --> 718 719 <div class="section"> 720 <h2><a name="compatibility">Cross-architecture compatibility</a></h2> 721 <p> 722 The dictionary format of libmarisa depends on the architecture. Dictionaries built on a little endian architecture don't work on a big endian architecture. Also, on a big endian architecture, dictionaries built on a 32-bit machine don't work on a 64-bit machine and vise versa. On a little endian architecture, dictionaries are compatible on 32/64-bit machines. 723 </p> 724 </div><!-- section --> 725 726 <div class="section"> 727 <h2><a name="references">References</a></h2> 728 <ul> 729 <li><a href="https://www.slideshare.net/s5yata/x86opti-05-s5yata">Remove Branches in BitVector Select Operations - marisa 0.2.2 -</a> 730 <ul> 731 <li>This PowerPoint presentation describes improvements in marisa 0.2.2.</li> 732 </ul> 733 </li> 734 </ul> 735 </div><!-- section --> 736 737 <div class="section"> 738 <h2><a name="conclusion">Last Spell</a></h2> 739 <p> 740 Feel free to contact me for any questions. 741 </p> 742 </div><!-- section --> 743 </div><!-- body --> 744 <div id="footer"> 745 <div class="left">MARISA: Matching Algorithm with Recursively Implemented StorAge</div> 746 <div class="right"> 747 moc.liamg@atay.umusus 748 </div> 749 <div class="end"></div> 750 </div><!-- footer --> 751 </body> 752</html> 753