MySQL – ngram Full-Text Parser
Table of content
- The ngram Full-Text Parser
- Configuring ngram Token Size
- Creating FULLTEXT Index Using ngram Parser
- ngram Parser Space Handling
- ngram Parser Stopword Handling
- ngram Parser Phrase Search
- ngram Parser Term Search
- ngram Parser Wildcard Search
- ngram Full-Text Parser Using a Client Program
”;
Usually in Full-Text searching, the built-in MySQL Full-Text parser considers the white spaces between words as delimiters. This determines where the words actually begin and end, to make the search simpler. However, this is only simple for languages that use spaces to separate words.
Several ideographic languages like Chinese, Japanese and Korean languages do not use word delimiters. To support full-text searches in languages like these, an ngram parser is used. This parser is supported by both InnoDB and MyISAM storage engines.
The ngram Full-Text Parser
An ngram is a continuous sequence of ”n” characters from a given sequence of text. The ngram parser divides a sequence of text into tokens as a contiguous sequence of n characters.
For example, consider the text ”Tutorial” and observe how it is tokenized by the ngram parser −
n=1: ''T'', ''u'', ''t'', ''o'', ''r'', ''i'', ''a'', ''l'' n=2: ''Tu'', ''ut'', ''to'' ''or'', ''ri'', ''ia'' ''al'' n=3: ''Tut'', ''uto'', ''tor'', ''ori'', ''ria'', ''ial'' n=4: ''Tuto'', ''utor'', ''tori'', ''oria'', ''rial'' n=5: ''Tutor'', ''utori'', ''toria'', ''orial'' n=6: ''Tutori'', ''utoria'', ''torial'' n=7: ''Tutoria'', ''utorial'' n=8: ''Tutorial''
The ngram full-text parser is a built-in server plugin. As with other built-in server plug-ins, it is automatically loaded when the server is started.
Configuring ngram Token Size
To change the token size, from its default size 2, use the ngram_token_size configuration option. The range of ngram values is from 1 to 10. But to increase the speed of search queries, use smallers token sizes; as smaller token sizes allow faster searches with smaller full-text search indexes.
Because ngram_token_size is a read-only variable, you can only set its value using two options:
Setting the –ngram_token_size in startup string:
mysqld --ngram_token_size=1
Setting ngram_token_size in configuration file ”my.cnf”:
[mysqld] ngram_token_size=1
Creating FULLTEXT Index Using ngram Parser
A FULLTEXT index can be created on columns of a table using the FULLTEXT keyword. This is used with CREATE TABLE, ALTER TABLE or CREATE INDEX SQL statements; you just have to specify ”WITH PARSER ngram”. Following is the syntax −
CREATE TABLE table_name ( column_name1 datatype, column_name2 datatype, column_name3 datatype, ... FULLTEXT (column_name(s)) WITH PARSER NGRAM ) ENGINE=INNODB CHARACTER SET UTF8mb4;
Example
In this example, we are creating a FULLTEXT index using the CREATE TABLE statement as follows −
CREATE TABLE blog ( ID INT AUTO_INCREMENT NOT NULL, TITLE VARCHAR(255), DESCRIPTION TEXT, FULLTEXT ( TITLE, DESCRIPTION ) WITH PARSER NGRAM, PRIMARY KEY(id) ) ENGINE=INNODB CHARACTER SET UTF8MB4; SET NAMES UTF8MB4;
Now, insert data (in any ideographic language) into this table created −
INSERT INTO BLOG VALUES (NULL, ''教程'', ''教程是对一个概念的冗长研究''), (NULL, ''文章'', ''文章是关于一个概念的基于事实的小信息'');
To check how the text is tokenized, execute the following statements −
SET GLOBAL innodb_ft_aux_table = "customers/blog"; SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
ngram Parser Space Handling
Any whitespace character is eliminated in the ngram parser when parsing. For instance, consider the following TEXT with token size 2 −
-
“ab cd” is parsed to “ab”, “cd”
-
“a bc” is parsed to “bc”
ngram Parser Stop word Handling
Apart from the whitespace character, MySQL has a stop word list consisting of various that are considered to be stopwords. If the parser encounters any word in the text present in the stopword list, the word is excluded from the index.
ngram Parser Phrase Search
Normal Phrase searches are converted to ngram phrase searches. For example, The search phrase “abc” is converted to “ab bc”, which returns documents containing “abc” and “ab bc”; and the search phrase “abc def” is converted to “ab bc de ef”, which returns documents containing “abc def” and “ab bc de ef”. A document that contains “abcdef” is not returned.
ngram Parser Term Search
For natural language mode search, the search term is converted to a union of ngram terms. For example, the string “abc” (assuming ngram_token_size=2) is converted to “ab bc”. Given two documents, one containing “ab” and the other containing “abc”, the search term “ab bc” matches both documents.
For boolean mode search, the search term is converted to an ngram phrase search. For example, the string ”abc” (assuming ngram_token_size=2) is converted to ””ab bc””. Given two documents, one containing ”ab” and the other containing ”abc”, the search phrase ””ab bc”” only matches the document containing ”abc”.
ngram Parser Wildcard Search
Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT search indexes:
-
If the prefix term of a wildcard search is shorter than ngram token size, the query returns all indexed rows that contain ngram tokens starting with the prefix term. For example, assuming ngram_token_size=2, a search on “a*” returns all rows starting with “a”.
-
If the prefix term of a wildcard search is longer than ngram token size, the prefix term is converted to an ngram phrase and the wildcard operator is ignored. For example, assuming ngram_token_size=2, an “abc*” wildcard search is converted to “ab bc”.
ngram Full-Text Parser Using a Client Program
We can also perform ngram full-text parser operation using the client program.
Syntax
To perform the ngram fulltext parser through a PHP programe, we need to execute the “Create” statement using the mysqli function query() as follows −
$sql = "CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), DESCRIPTION TEXT, FULLTEXT ( title, DESCRIPTION ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4"; $mysqli->query($sql);
To perform the ngram fulltext parser through a JavaScript program, we need to execute the “Create” statement using the query() function of mysql2 library as follows −
sql = `CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), DESCRIPTION TEXT, FULLTEXT ( title, DESCRIPTION ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4`; con.query(sql);
To perform the ngram fulltext parser through a Java program, we need to execute the “Create” statement using the JDBC function execute() as follows −
String sql = "CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT," + " FULLTEXT ( title, description ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4"; statement.execute(sql);
To perform the ngram fulltext parser through a python program, we need to execute the “Create” statement using the execute() function of the MySQL Connector/Python as follows −
create_table_query = ''CREATE TABLE blog(ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT, FULLTEXT (title, description) WITH PARSER NGRAM, PRIMARY KEY(id)) ENGINE=INNODB CHARACTER SET UTF8MB4'' cursorObj.execute(queryexpansionfulltext_search)
Example
Following are the programs −
$dbhost = "localhost"; $dbuser = "root"; $dbpass = "password"; $dbname = "TUTORIALS"; $mysqli = new mysqli($dbhost, $dbuser, $dbpass, $dbname); if ($mysqli->connect_errno) { printf("Connect failed: %s
", $mysqli->connect_error); exit(); } // printf(''Connected successfully.
''); /*CREATE Table*/ $sql = "CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT, FULLTEXT ( title, description ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4"; $result = $mysqli->query($sql); if ($result) { printf("Table created successfully...!n"); } //insert data $q = "INSERT INTO blog (id, title, description) VALUES (NULL, ''教程'', ''教程是对一个概念的冗长研究''), (NULL, ''文章'', ''文章是关于一个概念的基于事实的小信息'')"; if ($res = $mysqli->query($q)) { printf("Data inserted successfully...!n"); } //we will use the below statement to see how the ngram tokenizes the data: $setglobal = "SET GLOBAL innodb_ft_aux_table = ''TUTORIALS/blog''"; if ($mysqli->query($setglobal)) { echo "global innodb_ft_aux_table set...!"; } $s = "SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position "; if ($r = $mysqli->query($s)) { print_r($r); } //display data (ngram parser phrase search); $query = "SELECT * FROM blog WHERE MATCH (title, description) AGAINST (''教程'')"; if ($r = $mysqli->query($query)) { printf("Table Records: n"); while ($row = $r->fetch_assoc()) { printf( "ID: %d, Title: %s, Descriptions: %s", $row["id"], $row["title"], $row["description"] ); printf("n"); } } else { printf("Failed"); } $mysqli->close();
Output
The output obtained is as shown below −
global innodb_ft_aux_table set...!mysqli_result Object ( [current_field] => 0 [field_count] => 6 [lengths] => [num_rows] => 62 [type] => 0 ) Table Records: ID: 1, Title: 教程, Descriptions: 教程是对一个概念的冗长研究 ID: 3, Title: 教程, Descriptions: 教程是对一个概念的冗长研究
var mysql = require("mysql2"); var con = mysql.createConnection({ host: "localhost", user: "root", password: "password", }); //Connecting to MySQL con.connect(function (err) { if (err) throw err; // console.log("Connected successfully...!"); // console.log("--------------------------"); sql = "USE TUTORIALS"; con.query(sql); //create a table... sql = `CREATE TABLE blog (ID INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT, FULLTEXT ( title, description ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4`; con.query(sql); //insert data sql = `INSERT INTO blog (id, title, description) VALUES (NULL, ''教程'', ''教程是对一个概念的冗长研究''), (NULL, ''文章'', ''文章是关于一个概念的基于事实的小信息'')`; con.query(sql); //we will use the below statement to see how the ngram tokenizes the data: sql = "SET GLOBAL innodb_ft_aux_table = ''TUTORIALS/blog''"; con.query(sql); //display the table details; sql = `SELECT * FROM blog WHERE MATCH (title, description) AGAINST (''教程'')`; con.query(sql, function (err, result) { if (err) throw err; console.log(result); }); });
Output
The output obtained is as shown below −
[ { id: 1, title: ''教程'', description: ''教程是对一个概念的冗长研究'' } ]
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class NgRamFSearch { public static void main(String[] args) { String url = "jdbc:mysql://localhost:3306/TUTORIALS"; String username = "root"; String password = "password"; try { Class.forName("com.mysql.cj.jdbc.Driver"); Connection connection = DriverManager.getConnection(url, username, password); Statement statement = connection.createStatement(); System.out.println("Connected successfully...!"); //creating a table that takes fulltext column with parser ngram...! String sql = "CREATE TABLE blog (id INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT," + " FULLTEXT ( title, description ) WITH PARSER NGRAM, PRIMARY KEY(id) )ENGINE=INNODB CHARACTER SET UTF8MB4"; statement.execute(sql); //System.out.println("Table created successfully...!"); //inserting data to the table String insert = "INSERT INTO blog (id, title, description) VALUES (NULL, ''教程'', ''教程是对一个概念的冗长研究'')," + " (NULL, ''文章'', ''文章是关于一个概念的基于事实的小信息'')"; statement.execute(insert); //System.out.println("Data inserted successfully...!"); //we will use the below statement to see how the ngram tokenizes the data: String set_global = "SET GLOBAL innodb_ft_aux_table = ''TUTORIALS/blog''"; statement.execute(set_global); ResultSet resultSet = statement.executeQuery("SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position "); System.out.println("Information schema order by Id...!"); while (resultSet.next()){ System.out.println(resultSet.getString(1)+" "+resultSet.getString(2)); } //displaying the data...! String query = "SELECT * FROM blog WHERE MATCH (title, description) AGAINST (''教程'')"; ResultSet resultSet1 = statement.executeQuery(query); System.out.println("table records:"); while (resultSet1.next()){ System.out.println(resultSet1.getString(1)+" "+resultSet1.getString(2)+ " "+resultSet1.getString(3)); } connection.close(); } catch (Exception e) { System.out.println(e); } } }
Output
The output obtained is as shown below −
Connected successfully...! Information schema order by Id...! 教程 2 教程 2 程是 2 是对 2 对一 2 一个 2 个概 2 概念 2 念的 2 的冗 2 冗长 2 长研 2 研究 2 文章 3 文章 3 章是 3 是关 3 关于 3 于一 3 一个 2 个概 2 概念 2 念的 2 的基 3 基于 3 于事 3 事实 3 实的 3 的小 3 小信 3 信息 3 table records: 1 教程 教程是对一个概念的冗长研究
import mysql.connector # Establishing the connection connection = mysql.connector.connect( host=''localhost'', user=''root'', password=''password'', database=''tut'' ) # Creating a cursor object cursorObj = connection.cursor() # Create the blog table with NGRAM full-text parser create_table_query = '''''' CREATE TABLE blog ( id INT AUTO_INCREMENT NOT NULL, title VARCHAR(255), description TEXT, FULLTEXT (title, description) WITH PARSER NGRAM, PRIMARY KEY(id) ) ENGINE=INNODB CHARACTER SET UTF8MB4; '''''' cursorObj.execute(create_table_query) print("Table ''blog'' is created successfully!") # Set the character set to UTF8MB4 set_charset_query = "SET NAMES UTF8MB4;" cursorObj.execute(set_charset_query) print("Character set is set to UTF8MB4.") # Insert data into the blog table data_to_insert = [ (''教程'', ''教程是对一个概念的冗长研究''), (''文章'', ''文章是关于一个概念的基于事实的小信息'') ] insert_query = "INSERT INTO blog (title, description) VALUES (%s, %s)" cursorObj.executemany(insert_query, data_to_insert) connection.commit() print("Data inserted into the ''blog'' table.") # Query the INNODB_FT_INDEX_CACHE table to get the full-text index information query_index_cache_query = "SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;" cursorObj.execute(query_index_cache_query) results = cursorObj.fetchall() print("Results of INNODB_FT_INDEX_CACHE table:") for row in results: print(row) # Close the cursor and connection cursorObj.close() connection.close()
Output
The output obtained is as shown below −
Table ''blog'' is created successfully! Character set is set to UTF8MB4. Data inserted into the ''blog'' table. Results of INNODB_FT_INDEX_CACHE table:
”;