|
|
FTS3 is an SQLite virtual table module that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This document describes the deployment and usage of FTS3.
Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 extension module allows users to create special tables with a built-in full-text index (hereafter "FTS3 tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more instances specified word (hereafter a "token", even if the table contains many large documents.
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both the FTS3 table and the ordinary SQLite table created using the following SQL script:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ |
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */ |
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
Like other virtual table types, new FTS3 tables are created using a CREATE VIRTUAL TABLE statement. The module name, which follows the USING keyword, is "fts3". The virtual table module arguments may be left empty, in which case an FTS3 table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS3 table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. However, this is pure syntactic sugar, the supplied typenames are not used by FTS3 or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS3 column name - they are parsed but not used or recorded by the system in any way.
-- Create an FTS3 table named "data" with one column - "content": CREATE VIRTUAL TABLE data USING fts3(); -- Create an FTS3 table named "pages" with three columns: CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body); -- Create an FTS3 table named "mail" with two columns. Datatypes -- and column constraints are specified along with each column. These -- are completely ignored by FTS3 and SQLite. CREATE VIRTUAL TABLE mail USING fts3( subject VARCHAR(256) NOT NULL, body TEXT CHECK(length(body)<10240) ); |
As well as a list of columns, the module arguments passed to a CREATE VIRTUAL TABLE statement used to create an FTS3 table may be used to specify a tokenizer. This is done by specifying a string of the form "tokenize=<tokenizer name> <tokenizer args<" in place of a column name, where <tokenizer name> is the name of the tokenizer to use and <tokenizer args> is an optional list of whitespace separated qualifiers to pass to the tokenizer implementation. A tokenizer specification may be placed anywhere in the column list, but at most one tokenizer declaration is allowed for each CREATE VIRTUAL TABLE statement. The second and subsequent tokenizer declaration are interpreted as column names. See below for a detailed description of using (and, if necessary, implementing) a tokenizer.
-- Create an FTS3 table named "papers" with two columns that uses -- the tokenizer "porter". CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter); -- Create an FTS3 table with a single column - "content" - that uses -- the "simple" tokenizer. CREATE VIRTUAL TABLE data USING fts3(tokenize=simple); -- Create an FTS3 table with two columns that uses the "icu" tokenizer. -- The qualifier "en_AU" is passed to the tokenizer implementation CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); |
FTS3 tables may be dropped from the database using an ordinary DROP TABLE statement. For example:
-- Create, then immediately drop, an FTS3 table. CREATE VIRTUAL TABLE data USING fts3(); DROP TABLE data; |
FTS3 tables are populated using INSERT, UPDATE and DELETE statements in the same way as ordinary SQLite tables are.
As well as the columns named by the user (or the "content" column if no module arguments where specified as part of the CREATE VIRTUAL TABLE statement), each FTS3 table has a "rowid" column. The rowid of an FTS3 table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS3 table remain unchanged if the database is rebuilt using the VACUUM command. For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table.
There is one other subtle difference between "docid" and the normal SQLite aliases for the rowid column. Normally, if an INSERT or UPDATE statement assigns discreet values to two or more aliases of the rowid column, SQLite writes the rightmost of such values specified in the INSERT or UPDATE statement to the database. However, assigning a non-NULL value to both the "docid" and one or more of the SQLite rowid aliases when inserting or updating an FTS3 table is considered an error. See below for an example.
-- Create an FTS3 table
CREATE VIRTUAL TABLE pages USING fts3(title, body);
-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
-- Delete the entire table contents.
DELETE FROM pages;
-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS3 table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
|
To support full-text queries, FTS3 maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the data structure used to store this index within the database file is described below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS3 table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "SELECT optimize(<fts3-table>) FROM <fts3-table>" causes FTS3 to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may spe:#EE1 0 ustar root root
|
|
FTS3 is an SQLite virtual table module that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This document describes the deployment and usage of FTS3.
Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 extension module allows users to create special tables with a built-in full-text index (hereafter "FTS3 tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more instances specified word (hereafter a "token", even if the table contains many large documents.
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both the FTS3 table and the ordinary SQLite table created using the following SQL script:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ |
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */ |
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
Like other virtual table types, new FTS3 tables are created using a CREATE VIRTUAL TABLE statement. The module name, which follows the USING keyword, is "fts3". The virtual table module arguments may be left empty, in which case an FTS3 table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS3 table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. However, this is pure syntactic sugar, the supplied typenames are not used by FTS3 or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS3 column name - they are parsed but not used or recorded by the system in any way.
-- Create an FTS3 table named "data" with one column - "content": CREATE VIRTUAL TABLE data USING fts3(); -- Create an FTS3 table named "pages" with three columns: CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body); -- Create an FTS3 table named "mail" with two columns. Datatypes -- and column constraints are specified along with each column. These -- are completely ignored by FTS3 and SQLite. CREATE VIRTUAL TABLE mail USING fts3( subject VARCHAR(256) NOT NULL, body TEXT CHECK(length(body)<10240) ); |
As well as a list of columns, the module arguments passed to a CREATE VIRTUAL TABLE statement used to create an FTS3 table may be used to specify a tokenizer. This is done by specifying a string of the form "tokenize=<tokenizer name> <tokenizer args<" in place of a column name, where <tokenizer name> is the name of the tokenizer to use and <tokenizer args> is an optional list of whitespace separated qualifiers to pass to the tokenizer implementation. A tokenizer specification may be placed anywhere in the column list, but at most one tokenizer declaration is allowed for each CREATE VIRTUAL TABLE statement. The second and subsequent tokenizer declaration are interpreted as column names. See below for a detailed description of using (and, if necessary, implementing) a tokenizer.
-- Create an FTS3 table named "papers" with two columns that uses -- the tokenizer "porter". CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter); -- Create an FTS3 table with a single column - "content" - that uses -- the "simple" tokenizer. CREATE VIRTUAL TABLE data USING fts3(tokenize=simple); -- Create an FTS3 table with two columns that uses the "icu" tokenizer. -- The qualifier "en_AU" is passed to the tokenizer implementation CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); |
FTS3 tables may be dropped from the database using an ordinary DROP TABLE statement. For example:
-- Create, then immediately drop, an FTS3 table. CREATE VIRTUAL TABLE data USING fts3(); DROP TABLE data; |
FTS3 tables are populated using INSERT, UPDATE and DELETE statements in the same way as ordinary SQLite tables are.
As well as the columns named by the user (or the "content" column if no module arguments where specified as part of the CREATE VIRTUAL TABLE statement), each FTS3 table has a "rowid" column. The rowid of an FTS3 table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS3 table remain unchanged if the database is rebuilt using the VACUUM command. For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table.
There is one other subtle difference between "docid" and the normal SQLite aliases for the rowid column. Normally, if an INSERT or UPDATE statement assigns discreet values to two or more aliases of the rowid column, SQLite writes the rightmost of such values specified in the INSERT or UPDATE statement to the database. However, assigning a non-NULL value to both the "docid" and one or more of the SQLite rowid aliases when inserting or updating an FTS3 table is considered an error. See below for an example.
-- Create an FTS3 table
CREATE VIRTUAL TABLE pages USING fts3(title, body);
-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
-- Delete the entire table contents.
DELETE FROM pages;
-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS3 table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
|
To support full-text queries, FTS3 maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the data structure used to store this index within the database file is described below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS3 table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "SELECT optimize(<fts3-table>) FROM <fts3-table>" causes FTS3 to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may spe:#EE1 0 ustar root root
|
|
FTS3 is an SQLite virtual table module that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This document describes the deployment and usage of FTS3.
Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 extension module allows users to create special tables with a built-in full-text index (hereafter "FTS3 tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more instances specified word (hereafter a "token", even if the table contains many large documents.
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both the FTS3 table and the ordinary SQLite table created using the following SQL script:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ |
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */ |
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
Like other virtual table types, new FTS3 tables are created using a CREATE VIRTUAL TABLE statement. The module name, which follows the USING keyword, is "fts3". The virtual table module arguments may be left empty, in which case an FTS3 table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS3 table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. However, this is pure syntactic sugar, the supplied typenames are not used by FTS3 or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS3 column name - they are parsed but not used or recorded by the system in any way.
-- Create an FTS3 table named "data" with one column - "content": CREATE VIRTUAL TABLE data USING fts3(); -- Create an FTS3 table named "pages" with three columns: CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body); -- Create an FTS3 table named "mail" with two columns. Datatypes -- and column constraints are specified along with each column. These -- are completely ignored by FTS3 and SQLite. CREATE VIRTUAL TABLE mail USING fts3( subject VARCHAR(256) NOT NULL, body TEXT CHECK(length(body)<10240) ); |
As well as a list of columns, the module arguments passed to a CREATE VIRTUAL TABLE statement used to create an FTS3 table may be used to specify a tokenizer. This is done by specifying a string of the form "tokenize=<tokenizer name> <tokenizer args<" in place of a column name, where <tokenizer name> is the name of the tokenizer to use and <tokenizer args> is an optional list of whitespace separated qualifiers to pass to the tokenizer implementation. A tokenizer specification may be placed anywhere in the column list, but at most one tokenizer declaration is allowed for each CREATE VIRTUAL TABLE statement. The second and subsequent tokenizer declaration are interpreted as column names. See below for a detailed description of using (and, if necessary, implementing) a tokenizer.
-- Create an FTS3 table named "papers" with two columns that uses -- the tokenizer "porter". CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter); -- Create an FTS3 table with a single column - "content" - that uses -- the "simple" tokenizer. CREATE VIRTUAL TABLE data USING fts3(tokenize=simple); -- Create an FTS3 table with two columns that uses the "icu" tokenizer. -- The qualifier "en_AU" is passed to the tokenizer implementation CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); |
FTS3 tables may be dropped from the database using an ordinary DROP TABLE statement. For example:
-- Create, then immediately drop, an FTS3 table. CREATE VIRTUAL TABLE data USING fts3(); DROP TABLE data; |
FTS3 tables are populated using INSERT, UPDATE and DELETE statements in the same way as ordinary SQLite tables are.
As well as the columns named by the user (or the "content" column if no module arguments where specified as part of the CREATE VIRTUAL TABLE statement), each FTS3 table has a "rowid" column. The rowid of an FTS3 table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS3 table remain unchanged if the database is rebuilt using the VACUUM command. For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table.
There is one other subtle difference between "docid" and the normal SQLite aliases for the rowid column. Normally, if an INSERT or UPDATE statement assigns discreet values to two or more aliases of the rowid column, SQLite writes the rightmost of such values specified in the INSERT or UPDATE statement to the database. However, assigning a non-NULL value to both the "docid" and one or more of the SQLite rowid aliases when inserting or updating an FTS3 table is considered an error. See below for an example.
-- Create an FTS3 table
CREATE VIRTUAL TABLE pages USING fts3(title, body);
-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
-- Delete the entire table contents.
DELETE FROM pages;
-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS3 table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
|
To support full-text queries, FTS3 maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the data structure used to store this index within the database file is described below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS3 table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "SELECT optimize(<fts3-table>) FROM <fts3-table>" causes FTS3 to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may spe:#EE1 0 ustar root root
|
|
FTS3 is an SQLite virtual table module that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This document describes the deployment and usage of FTS3.
Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 extension module allows users to create special tables with a built-in full-text index (hereafter "FTS3 tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more instances specified word (hereafter a "token", even if the table contains many large documents.
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both the FTS3 table and the ordinary SQLite table created using the following SQL script:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ |
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */ |
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
Like other virtual table types, new FTS3 tables are created using a CREATE VIRTUAL TABLE statement. The module name, which follows the USING keyword, is "fts3". The virtual table module arguments may be left empty, in which case an FTS3 table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS3 table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. However, this is pure syntactic sugar, the supplied typenames are not used by FTS3 or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS3 column name - they are parsed but not used or recorded by the system in any way.
-- Create an FTS3 table named "data" with one column - "content": CREATE VIRTUAL TABLE data USING fts3(); -- Create an FTS3 table named "pages" with three columns: CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body); -- Create an FTS3 table named "mail" with two columns. Datatypes -- and column constraints are specified along with each column. These -- are completely ignored by FTS3 and SQLite. CREATE VIRTUAL TABLE mail USING fts3( subject VARCHAR(256) NOT NULL, body TEXT CHECK(length(body)<10240) ); |
As well as a list of columns, the module arguments passed to a CREATE VIRTUAL TABLE statement used to create an FTS3 table may be used to specify a tokenizer. This is done by specifying a string of the form "tokenize=<tokenizer name> <tokenizer args<" in place of a column name, where <tokenizer name> is the name of the tokenizer to use and <tokenizer args> is an optional list of whitespace separated qualifiers to pass to the tokenizer implementation. A tokenizer specification may be placed anywhere in the column list, but at most one tokenizer declaration is allowed for each CREATE VIRTUAL TABLE statement. The second and subsequent tokenizer declaration are interpreted as column names. See below for a detailed description of using (and, if necessary, implementing) a tokenizer.
-- Create an FTS3 table named "papers" with two columns that uses -- the tokenizer "porter". CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter); -- Create an FTS3 table with a single column - "content" - that uses -- the "simple" tokenizer. CREATE VIRTUAL TABLE data USING fts3(tokenize=simple); -- Create an FTS3 table with two columns that uses the "icu" tokenizer. -- The qualifier "en_AU" is passed to the tokenizer implementation CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU); |
FTS3 tables may be dropped from the database using an ordinary DROP TABLE statement. For example:
-- Create, then immediately drop, an FTS3 table. CREATE VIRTUAL TABLE data USING fts3(); DROP TABLE data; |
FTS3 tables are populated using INSERT, UPDATE and DELETE statements in the same way as ordinary SQLite tables are.
As well as the columns named by the user (or the "content" column if no module arguments where specified as part of the CREATE VIRTUAL TABLE statement), each FTS3 table has a "rowid" column. The rowid of an FTS3 table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS3 table remain unchanged if the database is rebuilt using the VACUUM command. For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table.
There is one other subtle difference between "docid" and the normal SQLite aliases for the rowid column. Normally, if an INSERT or UPDATE statement assigns discreet values to two or more aliases of the rowid column, SQLite writes the rightmost of such values specified in the INSERT or UPDATE statement to the database. However, assigning a non-NULL value to both the "docid" and one or more of the SQLite rowid aliases when inserting or updating an FTS3 table is considered an error. See below for an example.
-- Create an FTS3 table
CREATE VIRTUAL TABLE pages USING fts3(title, body);
-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
-- Insert a row and allow FTS3 to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
-- Delete the entire table contents.
DELETE FROM pages;
-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS3 table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
|
To support full-text queries, FTS3 maintains an inverted index that maps from each unique term or word that appears in the dataset to the locations in which it appears within the table contents. For the curious, a complete description of the data structure used to store this index within the database file is described below. A feature of this data structure is that at any time the database may contain not one index b-tree, but several different b-trees that are incrementally merged as rows are inserted, updated and deleted. This technique improves performance when writing to an FTS3 table, but causes some overhead for full-text queries that use the index. Executing an SQL statement of the form "SELECT optimize(<fts3-table>) FROM <fts3-table>" causes FTS3 to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may spe:#EE1 0 ustar root root
|
|
FTS3 is an SQLite virtual table module that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo and Altavista do with documents placed on the World Wide Web". Users input a term, or series of terms, perhaps connected by a binary operator or grouped together into a phrase, and the full-text query system finds the set of documents that best matches those terms considering the operators and groupings the user has specified. This document describes the deployment and usage of FTS3.
Portions of the original FTS3 code were contributed to the SQLite project by Scott Hess of Google. It is now developed and maintained as part of SQLite.
The FTS3 extension module allows users to create special tables with a built-in full-text index (hereafter "FTS3 tables"). The full-text index allows the user to efficiently query the database for all rows that contain one or more instances specified word (hereafter a "token", even if the table contains many large documents.
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both the FTS3 table and the ordinary SQLite table created using the following SQL script:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ |
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */ |
Of course, the two queries above are not entirely equivalent. For example the LIKE query matches rows that contain terms such as "linuxophobe" or "EnterpriseLinux" (as it happens, the Enron E-Mail Dataset does not actually contain any such terms), whereas the MATCH query on the FTS3 table selects only those rows that contain "linux" as a discrete token. Both searches are case-insensitive. The FTS3 table consumes around 2006 MB on disk compared to just 1453 MB for the ordinary table. Using the same hardware configuration used to perform the SELECT queries above, the FTS3 table took just under 31 minutes to populate, versus 25 for the ordinary table.
Like other virtual table types, new FTS3 tables are created using a CREATE VIRTUAL TABLE statement. The module name, which follows the USING keyword, is "fts3". The virtual table module arguments may be left empty, in which case an FTS3 table with a single user-defined column named "content" is created. Alternatively, the module arguments may be passed a list of comma separated column names.
If column names are explicitly provided for the FTS3 table as part of the CREATE VIRTUAL TABLE statement, then a datatype name may be optionally specified for each column. However, this is pure syntactic sugar, the supplied typenames are not used by FTS3 or the SQLite core for any purpose. The same applies to any constraints specified along with an FTS3 column name - they are parsed but not used or recorded by the system in any way.
-- Create an FTS3 table named "data" with one column - "content": CREATE VIRTUAL TABLE data USING fts3(); -- Create an FTS3 table named "pages" with three columns: CREATE VIRTUAL TABLE pages USING fts3(title, keywords, body); -- Create an FTS3 table named "mail" with two columns. Datatypes -- and column constraints are specified along with each column. These -- are completely ignored by FTS3 and SQLite. CREATE VIRTUAL TABLE mail USING fts3( subject |