• Home
Name Date Size #Lines LOC

..--

README.tokenizersD03-May-20245.1 KiB134105

README.txtD03-May-2024219 54

fts2.cD03-May-2024227.5 KiB7,2904,694

fts2.hD03-May-2024705 278

fts2_hash.cD03-May-202411 KiB375254

fts2_hash.hD03-May-20243.8 KiB11137

fts2_icu.cD03-May-20246.6 KiB261184

fts2_porter.cD03-May-202416.9 KiB642411

fts2_tokenizer.cD03-May-202410.4 KiB375220

fts2_tokenizer.hD03-May-20245.7 KiB14635

fts2_tokenizer1.cD03-May-20246.4 KiB231139

mkfts2amal.tclD03-May-20243.3 KiB11770

README.tokenizers

1
21. FTS2 Tokenizers
3
4  When creating a new full-text table, FTS2 allows the user to select
5  the text tokenizer implementation to be used when indexing text
6  by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE
7  statement:
8
9    CREATE VIRTUAL TABLE <table-name> USING fts2(
10      <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]]
11    );
12
13  The built-in tokenizers (valid values to pass as <tokenizer name>) are
14  "simple" and "porter".
15
16  <tokenizer-args> should consist of zero or more white-space separated
17  arguments to pass to the selected tokenizer implementation. The
18  interpretation of the arguments, if any, depends on the individual
19  tokenizer.
20
212. Custom Tokenizers
22
23  FTS2 allows users to provide custom tokenizer implementations. The
24  interface used to create a new tokenizer is defined and described in
25  the fts2_tokenizer.h source file.
26
27  Registering a new FTS2 tokenizer is similar to registering a new
28  virtual table module with SQLite. The user passes a pointer to a
29  structure containing pointers to various callback functions that
30  make up the implementation of the new tokenizer type. For tokenizers,
31  the structure (defined in fts2_tokenizer.h) is called
32  "sqlite3_tokenizer_module".
33
34  FTS2 does not expose a C-function that users call to register new
35  tokenizer types with a database handle. Instead, the pointer must
36  be encoded as an SQL blob value and passed to FTS2 through the SQL
37  engine by evaluating a special scalar function, "fts2_tokenizer()".
38  The fts2_tokenizer() function may be called with one or two arguments,
39  as follows:
40
41    SELECT fts2_tokenizer(<tokenizer-name>);
42    SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
43
44  Where <tokenizer-name> is a string identifying the tokenizer and
45  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
46  structure encoded as an SQL blob. If the second argument is present,
47  it is registered as tokenizer <tokenizer-name> and a copy of it
48  returned. If only one argument is passed, a pointer to the tokenizer
49  implementation currently registered as <tokenizer-name> is returned,
50  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
51  (error) is raised.
52
53  SECURITY: If the fts2 extension is used in an environment where potentially
54    malicious users may execute arbitrary SQL (i.e. gears), they should be
55    prevented from invoking the fts2_tokenizer() function, possibly using the
56    authorisation callback.
57
58  See "Sample code" below for an example of calling the fts2_tokenizer()
59  function from C code.
60
613. ICU Library Tokenizers
62
63  If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
64  symbol defined, then there exists a built-in tokenizer named "icu"
65  implemented using the ICU library. The first argument passed to the
66  xCreate() method (see fts2_tokenizer.h) of this tokenizer may be
67  an ICU locale identifier. For example "tr_TR" for Turkish as used
68  in Turkey, or "en_AU" for English as used in Australia. For example:
69
70    "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)"
71
72  The ICU tokenizer implementation is very simple. It splits the input
73  text according to the ICU rules for finding word boundaries and discards
74  any tokens that consist entirely of white-space. This may be suitable
75  for some applications in some locales, but not all. If more complex
76  processing is required, for example to implement stemming or
77  discard punctuation, this can be done by creating a tokenizer
78  implementation that uses the ICU tokenizer as part of its implementation.
79
80  When using the ICU tokenizer this way, it is safe to overwrite the
81  contents of the strings returned by the xNext() method (see
82  fts2_tokenizer.h).
83
844. Sample code.
85
86  The following two code samples illustrate the way C code should invoke
87  the fts2_tokenizer() scalar function:
88
89      int registerTokenizer(
90        sqlite3 *db,
91        char *zName,
92        const sqlite3_tokenizer_module *p
93      ){
94        int rc;
95        sqlite3_stmt *pStmt;
96        const char zSql[] = "SELECT fts2_tokenizer(?, ?)";
97
98        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
99        if( rc!=SQLITE_OK ){
100          return rc;
101        }
102
103        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
104        sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
105        sqlite3_step(pStmt);
106
107        return sqlite3_finalize(pStmt);
108      }
109
110      int queryTokenizer(
111        sqlite3 *db,
112        char *zName,
113        const sqlite3_tokenizer_module **pp
114      ){
115        int rc;
116        sqlite3_stmt *pStmt;
117        const char zSql[] = "SELECT fts2_tokenizer(?)";
118
119        *pp = 0;
120        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
121        if( rc!=SQLITE_OK ){
122          return rc;
123        }
124
125        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
126        if( SQLITE_ROW==sqlite3_step(pStmt) ){
127          if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
128            memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
129          }
130        }
131
132        return sqlite3_finalize(pStmt);
133      }
134

README.txt

1This folder contains source code to the second full-text search
2extension for SQLite.  While the API is the same, this version uses a
3substantially different storage schema from fts1, so tables will need
4to be rebuilt.
5