12.8 String Functions and Operators Show
Table 12.12 String Functions and Operators
String-valued functions return NULL if the length of the result would be greater than the value of the max_allowed_packet system variable. See Section 5.1.1, “Configuring the Server”. For functions that operate on string positions, the first position is numbered 1. For functions that take length arguments, noninteger arguments are rounded to the nearest integer.
Page 2
12.8.1 String Comparison Functions and Operators
Table 12.13 String Comparison Functions and Operators
If a string function is given a binary string as an argument, the resulting string is also a binary string. A number converted to a string is treated as a binary string. This affects only comparisons. Normally, if any expression in a string comparison is case-sensitive, the comparison is performed in case-sensitive fashion. If a string function is invoked from within the mysql client, binary strings display using hexadecimal notation, depending on the value of the --binary-as-hex. For more information about that option, see Section 4.5.1, “mysql — The MySQL Command-Line Client”.
Page 3
12.8.2 Regular Expressions
Table 12.14 Regular Expression Functions and Operators
A regular expression is a powerful way of specifying a pattern for a complex search. This section discusses the functions and operators available for regular expression matching and illustrates, with examples, some of the special characters and constructs that can be used for regular expression operations. See also Section 3.3.4.7, “Pattern Matching”. MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe. For information about ways in which applications that use regular expressions may be affected by the implementation change, see Regular Expression Compatibility Considerations.) Prior to MySQL 8.0.22, it was possible to use binary string arguments with these functions, but they yielded inconsistent results. In MySQL 8.0.22 and later, use of a binary string with any of the MySQL regular expression functions is rejected with ER_CHARACTER_SET_MISMATCH.
Regular Expression Function and Operator Descriptions
Regular Expression SyntaxA regular expression describes a set of strings. The simplest regular expression is one that has no special characters in it. For example, the regular expression hello matches hello and nothing else. Nontrivial regular expressions use certain special constructs so that they can match more than one string. For example, the regular expression hello|world contains the | alternation operator and matches either the hello or world. As a more complex example, the regular expression B[an]*s matches any of the strings Bananas, Baaaaas, Bs, and any other string starting with a B, ending with an s, and containing any number of a or n characters in between. The following list covers some of the basic special characters and constructs that can be used in regular expressions. For information about the full regular expression syntax supported by the ICU library used to implement regular expression support, visit the International Components for Unicode web site.
To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one: mysql> SELECT REGEXP_LIKE('1+2', '1+2'); -> 0 mysql> SELECT REGEXP_LIKE('1+2', '1\+2'); -> 0 mysql> SELECT REGEXP_LIKE('1+2', '1\\+2'); -> 1
Regular Expression Resource ControlREGEXP_LIKE() and similar functions use resources that can be controlled by setting system variables:
Regular Expression Compatibility ConsiderationsPrior to MySQL 8.0.4, MySQL used the Henry Spencer regular expression library to support regular expression operations, rather than International Components for Unicode (ICU). The following discussion describes differences between the Spencer and ICU libraries that may affect applications:
Page 4
12.8.3 Character Set and Collation of Function ResultsMySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string? For simple functions that take string input and return a string result as output, the output's character set and collation are the same as those of the principal input value. For example, UPPER(X) returns a string with the same character string and collation as X. The same applies for INSTR(), LCASE(), LOWER(), LTRIM(), MID(), REPEAT(), REPLACE(), REVERSE(), RIGHT(), RPAD(), RTRIM(), SOUNDEX(), SUBSTRING(), TRIM(), UCASE(), and UPPER().
Note The REPLACE() function, unlike all other functions, always ignores the collation of the string input and performs a case-sensitive comparison. If a string input or function result is a binary string, the string has the binary character set and collation. This can be checked by using the CHARSET() and COLLATION() functions, both of which return binary for a binary string argument: mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a'); +---------------------+-----------------------+ | CHARSET(BINARY 'a') | COLLATION(BINARY 'a') | +---------------------+-----------------------+ | binary | binary | +---------------------+-----------------------+For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply for determining the collation of the result:
For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X END, the resulting collation is X. The same applies for UNION, ||, CONCAT(), ELT(), GREATEST(), IF(), and LEAST(). For operations that convert to character data, the character set and collation of the strings that result from the operations are defined by the character_set_connection and collation_connection system variables that determine the default connection character set and collation (see Section 10.4, “Connection Character Sets and Collations”). This applies only to BIN_TO_UUID(), CAST(), CONV(), FORMAT(), HEX(), and SPACE(). An exception to the preceding principle occurs for expressions for virtual generated columns. In such expressions, the table character set is used for BIN_TO_UUID(), CONV(), or HEX() results, regardless of connection character set. If there is any question about the character set or collation of the result returned by a string function, use the CHARSET() or COLLATION() function to find out: mysql> SELECT USER(), CHARSET(USER()), COLLATION(USER()); +----------------+-----------------+-------------------+ | USER() | CHARSET(USER()) | COLLATION(USER()) | +----------------+-----------------+-------------------+ | test@localhost | utf8 | utf8_general_ci | +----------------+-----------------+-------------------+ mysql> SELECT CHARSET(COMPRESS('abc')), COLLATION(COMPRESS('abc')); +--------------------------+----------------------------+ | CHARSET(COMPRESS('abc')) | COLLATION(COMPRESS('abc')) | +--------------------------+----------------------------+ | binary | binary | +--------------------------+----------------------------+Page 5
12.9 What Calendar Is Used By MySQL?MySQL uses what is known as a proleptic Gregorian calendar. Every country that has switched from the Julian to the Gregorian calendar has had to discard at least ten days during the switch. To see how this works, consider the month of October 1582, when the first Julian-to-Gregorian switch occurred. There are no dates between October 4 and October 15. This discontinuity is called the cutover. Any dates before the cutover are Julian, and any dates following the cutover are Gregorian. Dates during a cutover are nonexistent. A calendar applied to dates when it was not actually in use is called proleptic. Thus, if we assume there was never a cutover and Gregorian rules always rule, we have a proleptic Gregorian calendar. This is what is used by MySQL, as is required by standard SQL. For this reason, dates prior to the cutover stored as MySQL DATE or DATETIME values must be adjusted to compensate for the difference. It is important to realize that the cutover did not occur at the same time in all countries, and that the later it happened, the more days were lost. For example, in Great Britain, it took place in 1752, when Wednesday September 2 was followed by Thursday September 14. Russia remained on the Julian calendar until 1918, losing 13 days in the process, and what is popularly referred to as its “October Revolution” occurred in November according to the Gregorian calendar. Page 6
12.10 Full-Text Search FunctionsMATCH (col1,col2,...) AGAINST (expr [search_modifier]) search_modifier: { IN NATURAL LANGUAGE MODE | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION | IN BOOLEAN MODE | WITH QUERY EXPANSION }MySQL has support for full-text indexing and searching:
Full-text searching is performed using MATCH() AGAINST() syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a string value that is constant during query evaluation. This rules out, for example, a table column because that can differ for each row. Previously, MySQL permitted the use of a rollup column with MATCH(), but queries employing this construct performed poorly and with unreliable results. (This is due to the fact that MATCH() is not implemented as a function of its arguments, but rather as a function of the row ID of the current row in the underlying scan of the base table.) As of MySQL 8.0.28, MySQL no longer allows such queries; more specifically, any query matching all of the criteria listed here is rejected with ER_FULLTEXT_WITH_ROLLUP:
Some examples of such queries are shown here: # MATCH() in SELECT list... SELECT MATCH (a) AGAINST ('abc') FROM t GROUP BY a WITH ROLLUP; SELECT 1 FROM t GROUP BY a, MATCH (a) AGAINST ('abc') WITH ROLLUP; # ...in HAVING clause... SELECT 1 FROM t GROUP BY a WITH ROLLUP HAVING MATCH (a) AGAINST ('abc'); # ...and in ORDER BY clause SELECT 1 FROM t GROUP BY a WITH ROLLUP ORDER BY MATCH (a) AGAINST ('abc');The use of MATCH() with a rollup column in the WHERE clause is permitted. There are three types of full-text searches:
For information about FULLTEXT query performance, see Section 8.3.5, “Column Indexes”. For more information about InnoDB FULLTEXT indexes, see Section 15.6.2.4, “InnoDB Full-Text Indexes”. Constraints on full-text searching are listed in Section 12.10.5, “Full-Text Restrictions”. The myisam_ftdump utility dumps the contents of a MyISAM full-text index. This may be helpful for debugging full-text queries. See Section 4.6.3, “myisam_ftdump — Display Full-Text Index information”. Page 7
12.10.1 Natural Language Full-Text SearchesBy default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list. mysql> CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) ) ENGINE=InnoDB; Query OK, 0 rows affected (0.08 sec) mysql> INSERT INTO articles (title,body) VALUES ('MySQL Tutorial','DBMS stands for DataBase ...'), ('How To Use MySQL Well','After you went through a ...'), ('Optimizing MySQL','In this tutorial, we show ...'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL vs. YourSQL','In the following database comparison ...'), ('MySQL Security','When configured properly, MySQL ...'); Query OK, 6 rows affected (0.01 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 5 | MySQL vs. YourSQL | In the following database comparison ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec)By default, the search is performed in case-insensitive fashion. To perform a case-sensitive full-text search, use a case-sensitive or binary collation for the indexed columns. For example, a column that uses the utf8mb4 character set of can be assigned a collation of utf8mb4_0900_as_cs or utf8mb4_bin to make it case-sensitive for full-text searches. When MATCH() is used in a WHERE clause, as in the example shown earlier, the rows returned are automatically sorted with the highest relevance first as long as the following conditions are met:
Given the conditions just listed, it is usually less effort to specify using ORDER BY an explicit sort order when one is necessary or desired. Relevance values are nonnegative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row (document), the number of unique words in the row, the total number of words in the collection, and the number of rows that contain a particular word.
Note The term “document” may be used interchangeably with the term “row”, and both terms refer to the indexed part of the row. The term “collection” refers to the indexed columns and encompasses all rows. To simply count matches, you could use a query like this: mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE); +----------+ | COUNT(*) | +----------+ | 2 | +----------+ 1 row in set (0.00 sec)You might find it quicker to rewrite the query as follows: mysql> SELECT COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), 1, NULL)) AS count FROM articles; +-------+ | count | +-------+ | 2 | +-------+ 1 row in set (0.03 sec)The first query does some extra work (sorting the results by relevance) but also can use an index lookup based on the WHERE clause. The index lookup might make the first query faster if the search matches few rows. The second query performs a full table scan, which might be faster than the index lookup if the search term was present in most rows. For natural-language full-text searches, the columns named in the MATCH() function must be the same columns included in some FULLTEXT index in your table. For the preceding query, note that the columns named in the MATCH() function (title and body) are the same as those named in the definition of the article table's FULLTEXT index. To search the title or body separately, you would create separate FULLTEXT indexes for each column. You can also perform a boolean search or a search with query expansion. These search types are described in Section 12.10.2, “Boolean Full-Text Searches”, and Section 12.10.3, “Full-Text Searches with Query Expansion”. A full-text search that uses an index can name columns only from a single table in the MATCH() clause because an index cannot span multiple tables. For MyISAM tables, a boolean search can be done in the absence of an index (albeit more slowly), in which case it is possible to name columns from multiple tables. The preceding example is a basic illustration that shows how to use the MATCH() function where rows are returned in order of decreasing relevance. The next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the SELECT statement includes neither WHERE nor ORDER BY clauses: mysql> SELECT id, MATCH (title,body) AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score FROM articles; +----+---------------------+ | id | score | +----+---------------------+ | 1 | 0.22764469683170319 | | 2 | 0 | | 3 | 0.22764469683170319 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----+---------------------+ 6 rows in set (0.00 sec)The following example is more complex. The query returns the relevance values and it also sorts the rows in order of decreasing relevance. To achieve this result, specify MATCH() twice: once in the SELECT list and once in the WHERE clause. This causes no additional overhead, because the MySQL optimizer notices that the two MATCH() calls are identical and invokes the full-text search code only once. mysql> SELECT id, body, MATCH (title,body) AGAINST ('Security implications of running MySQL as root' IN NATURAL LANGUAGE MODE) AS score FROM articles WHERE MATCH (title,body) AGAINST ('Security implications of running MySQL as root' IN NATURAL LANGUAGE MODE); +----+-------------------------------------+-----------------+ | id | body | score | +----+-------------------------------------+-----------------+ | 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 | | 6 | When configured properly, MySQL ... | 1.3114095926285 | +----+-------------------------------------+-----------------+ 2 rows in set (0.00 sec)A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase". If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty. The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as aaa'bbb. The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, (space), , (comma), and . (period). If words are not separated by delimiters (as in, for example, Chinese), the built-in FULLTEXT parser cannot determine where a word begins or ends. To be able to add words or other indexed terms in such languages to a FULLTEXT index that uses the built-in FULLTEXT parser, you must preprocess them so that they are separated by some arbitrary delimiter. Alternatively, you can create FULLTEXT indexes using the ngram parser plugin (for Chinese, Japanese, or Korean) or the MeCab parser plugin (for Japanese). It is possible to write a plugin that replaces the built-in full-text parser. For details, see The MySQL Plugin API. For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution. Some words are ignored in full-text searches:
See Section 12.10.4, “Full-Text Stopwords” to view default stopword lists and how to change them. The default minimum word length can be changed as described in Section 12.10.6, “Fine-Tuning MySQL Full-Text Search”. Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Thus, a word that is present in many documents has a lower weight, because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. This technique works best with large collections.
MyISAM Limitation For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results for search indexes on MyISAM tables. For example, although the word “MySQL” is present in every row of the articles table shown earlier, a search for the word in a MyISAM search index produces no results: mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL' IN NATURAL LANGUAGE MODE); Empty set (0.00 sec)The search result is empty because the word “MySQL” is present in at least 50% of the rows, and so is effectively treated as a stopword. This filtering technique is more suitable for large data sets, where you might not want the result set to return every second row from a 1GB table, than for small data sets where it might cause poor results for popular terms. The 50% threshold can surprise you when you first try full-text searching to see how it works, and makes InnoDB tables more suited to experimentation with full-text searches. If you create a MyISAM table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until the table contains more rows. Users who need to bypass the 50% limitation can build search indexes on InnoDB tables, or use the boolean search mode explained in Section 12.10.2, “Boolean Full-Text Searches”. Page 8
12.10.2 Boolean Full-Text SearchesMySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier. With this modifier, certain characters have special meaning at the beginning or end of words in the search string. In the following query, the + and - operators indicate that a word must be present or absent, respectively, for a match to occur. Thus, the query retrieves all the rows that contain the word “MySQL” but that do not contain the word “YourSQL”: mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE); +----+-----------------------+-------------------------------------+ | id | title | body | +----+-----------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 2 | How To Use MySQL Well | After you went through a ... | | 3 | Optimizing MySQL | In this tutorial, we show ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | | 6 | MySQL Security | When configured properly, MySQL ... | +----+-----------------------+-------------------------------------+
Note In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which
Boolean full-text searches have these characteristics:
The boolean full-text search capability supports the following operators:
The following examples demonstrate some search strings that use boolean full-text operators:
Relevancy Rankings for InnoDB Boolean Mode SearchInnoDB full-text search is modeled on the Sphinx full-text search engine, and the algorithms used are based on BM25 and TF-IDF ranking algorithms. For these reasons, relevancy rankings for InnoDB boolean full-text search may differ from MyISAM relevancy rankings. InnoDB uses a variation of the “term frequency-inverse document frequency” (TF-IDF) weighting system to rank a document's relevance for a given full-text search query. The TF-IDF weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked. How Relevancy Ranking is CalculatedThe term frequency (TF) value is the number of times that a word appears in a document. The inverse document frequency (IDF) value of a word is calculated using the following formula, where total_records is the number of records in the collection, and matching_records is the number of records that the search term appears in. ${IDF} = log10( ${total_records} / ${matching_records} )When a document contains a word multiple times, the IDF value is multiplied by the TF value: ${TF} * ${IDF}Using the TF and IDF values, the relevancy ranking for a document is calculated using this formula: ${rank} = ${TF} * ${IDF} * ${IDF}The formula is demonstrated in the following examples. Relevancy Ranking for a Single Word SearchThis example demonstrates the relevancy ranking calculation for a single-word search. mysql> CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) ) ENGINE=InnoDB; Query OK, 0 rows affected (1.04 sec) mysql> INSERT INTO articles (title,body) VALUES ('MySQL Tutorial','This database tutorial ...'), ("How To Use MySQL",'After you went through a ...'), ('Optimizing Your Database','In this database tutorial ...'), ('MySQL vs. YourSQL','When comparing databases ...'), ('MySQL Security','When configured properly, MySQL ...'), ('Database, Database, Database','database database database'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..'); Query OK, 8 rows affected (0.06 sec) Records: 8 Duplicates: 0 Warnings: 0 mysql> SELECT id, title, body, MATCH (title,body) AGAINST ('database' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC; +----+------------------------------+-------------------------------------+---------------------+ | id | title | body | score | +----+------------------------------+-------------------------------------+---------------------+ | 6 | Database, Database, Database | database database database | 1.0886961221694946 | | 3 | Optimizing Your Database | In this database tutorial ... | 0.36289870738983154 | | 1 | MySQL Tutorial | This database tutorial ... | 0.18144935369491577 | | 2 | How To Use MySQL | After you went through a ... | 0 | | 4 | MySQL vs. YourSQL | When comparing databases ... | 0 | | 5 | MySQL Security | When configured properly, MySQL ... | 0 | | 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0 | | 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0 | +----+------------------------------+-------------------------------------+---------------------+ 8 rows in set (0.00 sec)There are 8 records in total, with 3 that match the “database” search term. The first record (id 6) contains the search term 6 times and has a relevancy ranking of 1.0886961221694946. This ranking value is calculated using a TF value of 6 (the “database” search term appears 6 times in record id 6) and an IDF value of 0.42596873216370745, which is calculated as follows (where 8 is the total number of records and 3 is the number of records that the search term appears in): ${IDF} = log10( 8 / 3 ) = 0.42596873216370745The TF and IDF values are then entered into the ranking formula: ${rank} = ${TF} * ${IDF} * ${IDF}Performing the calculation in the MySQL command-line client returns a ranking value of 1.088696164686938. mysql> SELECT 6*log10(8/3)*log10(8/3); +-------------------------+ | 6*log10(8/3)*log10(8/3) | +-------------------------+ | 1.088696164686938 | +-------------------------+ 1 row in set (0.00 sec)
Note You may notice a slight difference in the ranking values returned by the SELECT ... MATCH ... AGAINST statement and the MySQL command-line client (1.0886961221694946 versus 1.088696164686938). The difference is due to how the casts between integers and floats/doubles are performed internally by InnoDB (along with related precision and rounding decisions), and how they are performed elsewhere, such as in the MySQL command-line client or other types of calculators. Relevancy Ranking for a Multiple Word SearchThis example demonstrates the relevancy ranking calculation for a multiple-word full-text search based on the articles table and data used in the previous example. If you search on more than one word, the relevancy ranking value is a sum of the relevancy ranking value for each word, as shown in this formula: ${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}Performing a search on two terms ('mysql tutorial') returns the following results: mysql> SELECT id, title, body, MATCH (title,body) AGAINST ('mysql tutorial' IN BOOLEAN MODE) AS score FROM articles ORDER BY score DESC; +----+------------------------------+-------------------------------------+----------------------+ | id | title | body | score | +----+------------------------------+-------------------------------------+----------------------+ | 1 | MySQL Tutorial | This database tutorial ... | 0.7405621409416199 | | 3 | Optimizing Your Database | In this database tutorial ... | 0.3624762296676636 | | 5 | MySQL Security | When configured properly, MySQL ... | 0.031219376251101494 | | 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0.031219376251101494 | | 2 | How To Use MySQL | After you went through a ... | 0.015609688125550747 | | 4 | MySQL vs. YourSQL | When comparing databases ... | 0.015609688125550747 | | 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 | | 6 | Database, Database, Database | database database database | 0 | +----+------------------------------+-------------------------------------+----------------------+ 8 rows in set (0.00 sec)In the first record (id 8), 'mysql' appears once and 'tutorial' appears twice. There are six matching records for 'mysql' and two matching records for 'tutorial'. The MySQL command-line client returns the expected ranking value when inserting these values into the ranking formula for a multiple word search: mysql> SELECT (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2)); +-------------------------------------------------------+ | (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2)) | +-------------------------------------------------------+ | 0.7405621541938003 | +-------------------------------------------------------+ 1 row in set (0.00 sec)
Note The slight difference in the ranking values returned by the SELECT ... MATCH ... AGAINST statement and the MySQL command-line client is explained in the preceding example. Page 9
12.10.3 Full-Text Searches with Query ExpansionFull-text search supports query expansion (and in particular, its variant “blind query expansion”). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine lacks. For example, a user searching for “database” may really mean that “MySQL”, “Oracle”, “DB2”, and “RDBMS” all are phrases that should match “databases” and should be returned, too. This is implied knowledge. Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION or IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”. The following example shows this difference: mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 5 | MySQL vs. YourSQL | In the following database comparison ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec) mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' WITH QUERY EXPANSION); +----+-----------------------+------------------------------------------+ | id | title | body | +----+-----------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 3 | Optimizing MySQL | In this tutorial we show ... | | 6 | MySQL Security | When configured properly, MySQL ... | | 2 | How To Use MySQL Well | After you went through a ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | +----+-----------------------+------------------------------------------+ 6 rows in set (0.00 sec)Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion finds all books with the word “Maigret” on the second pass.
Note Because blind query expansion tends to increase noise significantly by returning nonrelevant documents, use it only when a search phrase is short. Page 10
12.10.4 Full-Text StopwordsThe stopword list is loaded and searched for full-text queries using the server character set and collation (the values of the character_set_server and collation_server system variables). False hits or misses might occur for stopword lookups if the stopword file or columns used for full-text indexing or searches have a character set or collation different from character_set_server or collation_server. Case sensitivity of stopword lookups depends on the server collation. For example, lookups are case-insensitive if the collation is utf8mb4_0900_ai_ci, whereas lookups are case-sensitive if the collation is utf8mb4_0900_as_cs or utf8mb4_bin.
Stopwords for InnoDB Search IndexesInnoDB has a relatively short list of default stopwords, because documents from technical, literary, and other sources often use short words as keywords or in significant phrases. For example, you might search for “to be or not to be” and expect to get a sensible result, rather than having all those words ignored. To see the default InnoDB stopword list, query the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table. mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD; +-------+ | value | +-------+ | a | | about | | an | | are | | as | | at | | be | | by | | com | | de | | en | | for | | from | | how | | i | | in | | is | | it | | la | | of | | on | | or | | that | | the | | this | | to | | was | | what | | when | | where | | who | | will | | with | | und | | the | | www | +-------+ 36 rows in set (0.00 sec)To define your own stopword list for all InnoDB tables, define a table with the same structure as the INNODB_FT_DEFAULT_STOPWORD table, populate it with stopwords, and set the value of the innodb_ft_server_stopword_table option to a value in the form db_name/table_name before creating the full-text index. The stopword table must have a single VARCHAR column named value. The following example demonstrates creating and configuring a new global stopword table for InnoDB. -- Create a new stopword table mysql> CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB; Query OK, 0 rows affected (0.01 sec) -- Insert stopwords (for simplicity, a single stopword is used in this example) mysql> INSERT INTO my_stopwords(value) VALUES ('Ishmael'); Query OK, 1 row affected (0.00 sec) -- Create the table mysql> CREATE TABLE opening_lines ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, opening_line TEXT(500), author VARCHAR(200), title VARCHAR(200) ) ENGINE=InnoDB; Query OK, 0 rows affected (0.01 sec) -- Insert data into the table mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES ('Call me Ishmael.','Herman Melville','Moby-Dick'), ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'), ('I am an invisible man.','Ralph Ellison','Invisible Man'), ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'), ('It was love at first sight.','Joseph Heller','Catch-22'), ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'), ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'), ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 -- Set the innodb_ft_server_stopword_table option to the new stopword table mysql> SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords'; Query OK, 0 rows affected (0.00 sec) -- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined) mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line); Query OK, 0 rows affected, 1 warning (1.17 sec) Records: 0 Duplicates: 0 Warnings: 1Verify that the specified stopword ('Ishmael') does not appear by querying the words in INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE.
Note By default, words less than 3 characters in length or greater than 84 characters in length do not appear in an InnoDB full-text search index. Maximum and minimum word length values are configurable using the innodb_ft_max_token_size and innodb_ft_min_token_size variables. This default behavior does not apply to the ngram parser plugin. ngram token size is defined by the ngram_token_size option. mysql> SET GLOBAL innodb_ft_aux_table='test/opening_lines'; Query OK, 0 rows affected (0.00 sec) mysql> SELECT word FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 15; +-----------+ | word | +-----------+ | across | | all | | burn | | buy | | call | | comes | | dalloway | | first | | flowers | | happened | | herself | | invisible | | less | | love | | man | +-----------+ 15 rows in set (0.00 sec)To create stopword lists on a table-by-table basis, create other stopword tables and use the innodb_ft_user_stopword_table option to specify the stopword table that you want to use before you create the full-text index.
Stopwords for MyISAM Search IndexesThe stopword file is loaded and searched using latin1 if character_set_server is ucs2, utf16, utf16le, or utf32. To override the default stopword list for MyISAM tables, set the ft_stopword_file system variable. (See Section 5.1.8, “Server System Variables”.) The variable value should be the path name of the file containing the stopword list, or the empty string to disable stopword filtering. The server looks for the file in the data directory unless an absolute path name is given to specify a different directory. After changing the value of this variable or the contents of the stopword file, restart the server and rebuild your FULLTEXT indexes. The stopword list is free-form, separating stopwords with any nonalphanumeric character such as newline, space, or comma. Exceptions are the underscore character (_) and a single apostrophe (') which are treated as part of a word. The character set of the stopword list is the server's default character set; see Section 10.3.2, “Server Character Set and Collation”. The following list shows the default stopwords for MyISAM search indexes. In a MySQL source distribution, you can find this list in the storage/myisam/ft_static.c file. a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently definitely described despite did didn't different do does doesn't doing don't done down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from further furthermore get gets getting given gives go goes going gone got gotten greetings had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself just keep keeps kept know known knows last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really reasonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would wouldn't yes yet you you'd you'll you're you've your yours yourself yourselves zeroPage 11
12.10.5 Full-Text Restrictions
Page 12
12.10.6 Fine-Tuning MySQL Full-Text SearchMySQL's full-text search capability has few user-tunable parameters. You can exert more control over full-text searching behavior if you have a MySQL source distribution because some changes require source code modifications. See Section 2.9, “Installing MySQL from Source”. Full-text search is carefully tuned for effectiveness. Modifying the default behavior in most cases can actually decrease effectiveness. Do not alter the MySQL sources unless you know what you are doing. Most full-text variables described in this section must be set at server startup time. A server restart is required to change them; they cannot be modified while the server is running. Some variable changes require that you rebuild the FULLTEXT indexes in your tables. Instructions for doing so are given later in this section.
Configuring Minimum and Maximum Word LengthThe minimum and maximum lengths of words to be indexed are defined by the innodb_ft_min_token_size and innodb_ft_max_token_size for InnoDB search indexes, and ft_min_word_len and ft_max_word_len for MyISAM ones.
Note Minimum and maximum word length full-text parameters do not apply to FULLTEXT indexes created using the ngram parser. ngram token size is defined by the ngram_token_size option. After changing any of these options, rebuild your FULLTEXT indexes for the change to take effect. For example, to make two-character words searchable, you could put the following lines in an option file: [mysqld] innodb_ft_min_token_size=2 ft_min_word_len=2Then restart the server and rebuild your FULLTEXT indexes. For MyISAM tables, note the remarks regarding myisamchk in the instructions that follow for rebuilding MyISAM full-text indexes.
Configuring the Natural Language Search ThresholdFor MyISAM search indexes, the 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in storage/myisam/ftdefs.h: Change that line to this: #define GWS_IN_USE GWS_FREQThen recompile MySQL. There is no need to rebuild the indexes in this case.
Note By making this change, you severely decrease MySQL's ability to provide adequate relevance values for the MATCH() function. If you really need to search for such common words, it would be better to search using IN BOOLEAN MODE instead, which does not observe the 50% threshold.
Modifying Boolean Full-Text Search OperatorsTo change the operators used for boolean full-text searches on MyISAM tables, set the ft_boolean_syntax system variable. (InnoDB does not have an equivalent setting.) This variable can be changed while the server is running, but you must have privileges sufficient to set global system variables (see Section 5.1.9.1, “System Variable Privileges”). No rebuilding of indexes is necessary in this case.
Character Set ModificationsFor the built-in full-text parser, you can change the set of characters that are considered word characters in several ways, as described in the following list. After making the modification, rebuild the indexes for each table that contains any FULLTEXT indexes. Suppose that you want to treat the hyphen character ('-') as a word character. Use one of these methods:
Rebuilding InnoDB Full-Text IndexesFor the changes to take effect, FULLTEXT indexes must be rebuilt after modifying any of the following full-text index variables: innodb_ft_min_token_size; innodb_ft_max_token_size; innodb_ft_server_stopword_table; innodb_ft_user_stopword_table; innodb_ft_enable_stopword; ngram_token_size. Modifying innodb_ft_min_token_size, innodb_ft_max_token_size, or ngram_token_size requires restarting the server. To rebuild FULLTEXT indexes for an InnoDB table, use ALTER TABLE with the DROP INDEX and ADD INDEX options to drop and re-create each index.
Optimizing InnoDB Full-Text IndexesRunning OPTIMIZE TABLE on a table with a full-text index rebuilds the full-text index, removing deleted Document IDs and consolidating multiple entries for the same word, where possible. To optimize a full-text index, enable innodb_optimize_fulltext_only and run OPTIMIZE TABLE. mysql> set GLOBAL innodb_optimize_fulltext_only=ON; Query OK, 0 rows affected (0.01 sec) mysql> OPTIMIZE TABLE opening_lines; +--------------------+----------+----------+----------+ | Table | Op | Msg_type | Msg_text | +--------------------+----------+----------+----------+ | test.opening_lines | optimize | status | OK | +--------------------+----------+----------+----------+ 1 row in set (0.01 sec)To avoid lengthy rebuild times for full-text indexes on large tables, you can use the innodb_ft_num_word_optimize option to perform the optimization in stages. The innodb_ft_num_word_optimize option defines the number of words that are optimized each time OPTIMIZE TABLE is run. The default setting is 2000, which means that 2000 words are optimized each time OPTIMIZE TABLE is run. Subsequent OPTIMIZE TABLE operations continue from where the preceding OPTIMIZE TABLE operation ended.
Rebuilding MyISAM Full-Text IndexesIf you modify full-text variables that affect indexing (ft_min_word_len, ft_max_word_len, or ft_stopword_file), or if you change the stopword file itself, you must rebuild your FULLTEXT indexes after making the changes and restarting the server. To rebuild the FULLTEXT indexes for a MyISAM table, it is sufficient to do a QUICK repair operation: mysql> REPAIR TABLE tbl_name QUICK;Alternatively, use ALTER TABLE as just described. In some cases, this may be faster than a repair operation. Each table that contains any FULLTEXT index must be repaired as just shown. Otherwise, queries for the table may yield incorrect results, and modifications to the table causes the server to see the table as corrupt and in need of repair. If you use myisamchk to perform an operation that modifies MyISAM table indexes (such as repair or analyze), the FULLTEXT indexes are rebuilt using the default full-text parameter values for minimum word length, maximum word length, and stopword file unless you specify otherwise. This can result in queries failing. The problem occurs because these parameters are known only by the server. They are not stored in MyISAM index files. To avoid the problem if you have modified the minimum or maximum word length or stopword file values used by the server, specify the same ft_min_word_len, ft_max_word_len, and ft_stopword_file values for myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can repair a table with myisamchk like this: myisamchk --recover --ft_min_word_len=3 tbl_name.MYITo ensure that myisamchk and the server use the same values for full-text parameters, place each one in both the [mysqld] and [myisamchk] sections of an option file: [mysqld] ft_min_word_len=3 [myisamchk] ft_min_word_len=3An alternative to using myisamchk for MyISAM table index modification is to use the REPAIR TABLE, ANALYZE TABLE, OPTIMIZE TABLE, or ALTER TABLE statements. These statements are performed by the server, which knows the proper full-text parameter values to use. Page 13
12.10.7 Adding a User-Defined Collation for Full-Text IndexingThis section describes how to add a user-defined collation for full-text searches using the built-in full-text parser. The sample collation is like latin1_swedish_ci but treats the '-' character as a letter rather than as a punctuation character so that it can be indexed as a word character. General information about adding collations is given in Section 10.14, “Adding a Collation to a Character Set”; it is assumed that you have read it and are familiar with the files involved. To add a collation for full-text indexing, use the following procedure. The instructions here add a collation for a simple character set, which as discussed in Section 10.14, “Adding a Collation to a Character Set”, can be created using a configuration file that describes the character set properties. For a complex character set such as Unicode, create collations using C source files that describe the character set properties.
Page 14
12.10.8 ngram Full-Text ParserThe built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK). The ngram full-text parser is supported for use with InnoDB and MyISAM. An ngram is a contiguous sequence of n characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of n characters. For example, you can tokenize “abcd” for different values of n using the ngram full-text parser. n=1: 'a', 'b', 'c', 'd' n=2: 'ab', 'bc', 'cd' n=3: 'abc', 'bcd' n=4: 'abcd'The ngram full-text parser is a built-in server plugin. As with other built-in server plugins, it is automatically loaded when the server is started. The full-text search syntax described in Section 12.10, “Full-Text Search Functions” applies to the ngram parser plugin. Differences in parsing behavior are described in this section. Full-text-related configuration options, except for minimum and maximum word length options (innodb_ft_min_token_size, innodb_ft_max_token_size, ft_min_word_len, ft_max_word_len) are also applicable. Configuring ngram Token SizeThe ngram parser has a default ngram token size of 2 (bigram). For example, with a token size of 2, the ngram parser parses the string “abc def” into four tokens: “ab”, “bc”, “de” and “ef”. ngram token size is configurable using the ngram_token_size configuration option, which has a minimum value of 1 and maximum value of 10. Typically, ngram_token_size is set to the size of the largest token that you want to search for. If you only intend to search for single characters, set ngram_token_size to 1. A smaller token size produces a smaller full-text search index, and faster searches. If you need to search for words comprised of more than one character, set ngram_token_size accordingly. For example, “Happy Birthday” is “生日快乐” in simplified Chinese, where “生日” is “birthday”, and “快乐” translates as “happy”. To search on two-character words such as these, set ngram_token_size to a value of 2 or higher. As a read-only variable, ngram_token_size may only be set as part of a startup string or in a configuration file:
To create a FULLTEXT index that uses the ngram parser, specify WITH PARSER ngram with CREATE TABLE, ALTER TABLE, or CREATE INDEX. The following example demonstrates creating a table with an ngram FULLTEXT index, inserting sample data (Simplified Chinese text), and viewing tokenized data in the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table. mysql> USE test; mysql> CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) WITH PARSER ngram ) ENGINE=InnoDB CHARACTER SET utf8mb4; mysql> SET NAMES utf8mb4; INSERT INTO articles (title,body) VALUES ('数据库管理','在本教程中我将向你展示如何管理数据库'), ('数据库应用开发','学习开发数据库应用程序'); mysql> SET GLOBAL innodb_ft_aux_table="test/articles"; mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;To add a FULLTEXT index to an existing table, you can use ALTER TABLE or CREATE INDEX. For example: CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT ) ENGINE=InnoDB CHARACTER SET utf8; ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER ngram; # Or: CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER ngram;ngram Parser Space HandlingThe ngram parser eliminates spaces when parsing. For example:
The built-in MySQL full-text parser compares words to entries in the stopword list. If a word is equal to an entry in the stopword list, the word is excluded from the index. For the ngram parser, stopword handling is performed differently. Instead of excluding tokens that are equal to entries in the stopword list, the ngram parser excludes tokens that contain stopwords. For example, assuming ngram_token_size=2, a document that contains “a,b” is parsed to “a,” and “,b”. If a comma (“,”) is defined as a stopword, both “a,” and “,b” are excluded from the index because they contain a comma. By default, the ngram parser uses the default stopword list, which contains a list of English stopwords. For a stopword list applicable to Chinese, Japanese, or Korean, you must create your own. For information about creating a stopword list, see Section 12.10.4, “Full-Text Stopwords”. Stopwords greater in length than ngram_token_size are ignored. ngram Parser Term SearchFor natural language mode search, the search term is converted to a union of ngram terms. For example, the string “abc” (assuming ngram_token_size=2) is converted to “ab bc”. Given two documents, one containing “ab” and the other containing “abc”, the search term “ab bc” matches both documents. For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming ngram_token_size=2) is converted to '“ab bc”'. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase '“ab bc”' only matches the document containing 'abc'. ngram Parser Wildcard SearchBecause an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT search indexes:
Phrase searches are converted to ngram phrase searches. For example, The search phrase “abc” is converted to “ab bc”, which returns documents containing “abc” and “ab bc”. The search phrase “abc def” is converted to “ab bc de ef”, which returns documents containing “abc def” and “ab bc de ef”. A document that contains “abcdef” is not returned. Page 15
12.10.9 MeCab Full-Text Parser PluginThe built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation for Japanese, MySQL provides a MeCab full-text parser plugin. The MeCab full-text parser plugin is supported for use with InnoDB and MyISAM. The MeCab full-text parser plugin is a full-text parser plugin for Japanese that tokenizes a sequence of text into meaningful words. For example, MeCab tokenizes “データベース管理” (“Database Management”) into “データベース” (“Database”) and “管理” (“Management”). By comparison, the ngram full-text parser tokenizes text into a contiguous sequence of n characters, where n represents a number between 1 and 10. In addition to tokenizing text into meaningful words, MeCab indexes are typically smaller than ngram indexes, and MeCab full-text searches are generally faster. One drawback is that it may take longer for the MeCab full-text parser to tokenize documents, compared to the ngram full-text parser. The full-text search syntax described in Section 12.10, “Full-Text Search Functions” applies to the MeCab parser plugin. Differences in parsing behavior are described in this section. Full-text related configuration options are also applicable. For additional information about the MeCab parser, refer to the MeCab: Yet Another Part-of-Speech and Morphological Analyzer project on Github. Installing the MeCab Parser PluginThe MeCab parser plugin requires mecab and mecab-ipadic. On supported Fedora, Debian and Ubuntu platforms (except Ubuntu 12.04 where the system mecab version is too old), MySQL dynamically links to the system mecab installation if it is installed to the default location. On other supported Unix-like platforms, libmecab.so is statically linked in libpluginmecab.so, which is located in the MySQL plugin directory. mecab-ipadic is included in MySQL binaries and is located in MYSQL_HOME\lib\mecab. You can install mecab and mecab-ipadic using a native package management utility (on Fedora, Debian, and Ubuntu), or you can build mecab and mecab-ipadic from source. For information about installing mecab and mecab-ipadic using a native package management utility, see Installing MeCab From a Binary Distribution (Optional). If you want to build mecab and mecab-ipadic from source, see Building MeCab From Source (Optional). On Windows, libmecab.dll is found in the MySQL bin directory. mecab-ipadic is located in MYSQL_HOME/lib/mecab. To install and configure the MeCab parser plugin, perform the following steps:
To create a FULLTEXT index that uses the mecab parser, specify WITH PARSER ngram with CREATE TABLE, ALTER TABLE, or CREATE INDEX. This example demonstrates creating a table with a mecab FULLTEXT index, inserting sample data, and viewing tokenized data in the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table: mysql> USE test; mysql> CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) WITH PARSER mecab ) ENGINE=InnoDB CHARACTER SET utf8; mysql> SET NAMES utf8; mysql> INSERT INTO articles (title,body) VALUES ('データベース管理','このチュートリアルでは、私はどのようにデータベースを管理する方法を紹介します'), ('データベースアプリケーション開発','データベースアプリケーションを開発することを学ぶ'); mysql> SET GLOBAL innodb_ft_aux_table="test/articles"; mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;To add a FULLTEXT index to an existing table, you can use ALTER TABLE or CREATE INDEX. For example: CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT ) ENGINE=InnoDB CHARACTER SET utf8; ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER mecab; # Or: CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER mecab;MeCab Parser Space HandlingThe MeCab parser uses spaces as separators in query strings. For example, the MeCab parser tokenizes データベース管理 as データベース and 管理. MeCab Parser Stopword HandlingBy default, the MeCab parser uses the default stopword list, which contains a short list of English stopwords. For a stopword list applicable to Japanese, you must create your own. For information about creating stopword lists, see Section 12.10.4, “Full-Text Stopwords”. MeCab Parser Term SearchFor natural language mode search, the search term is converted to a union of tokens. For example, データベース管理 is converted to データベース 管理. SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN NATURAL LANGUAGE MODE);For boolean mode search, the search term is converted to a search phrase. For example, データベース管理 is converted to データベース 管理. SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN BOOLEAN MODE);MeCab Parser Wildcard SearchWildcard search terms are not tokenized. A search on データベース管理* is performed on the prefix, データベース管理. SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース*' IN BOOLEAN MODE);MeCab Parser Phrase SearchPhrases are tokenized. For example, データベース管理 is tokenized as データベース 管理. SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('"データベース管理"' IN BOOLEAN MODE);Installing MeCab From a Binary Distribution (Optional)This section describes how to install mecab and mecab-ipadic from a binary distribution using a native package management utility. For example, on Fedora, you can use Yum to perform the installation: yum mecab-develOn Debian or Ubuntu, you can perform an APT installation: apt-get install mecab apt-get install mecab-ipadicInstalling MeCab From Source (Optional)If you want to build mecab and mecab-ipadic from source, basic installation steps are provided below. For additional information, refer to the MeCab documentation.
Page 16
12.11 Cast Functions and Operators
Table 12.15 Cast Functions and Operators
Cast functions and operators enable conversion of values from one data type to another.
Cast Function and Operator Descriptions
Character Set ConversionsCONVERT() with a USING clause converts data between character sets: CONVERT(expr USING transcoding_name)In MySQL, transcoding names are the same as the corresponding character set names. Examples: SELECT CONVERT('test' USING utf8mb4); SELECT CONVERT(_latin1'Müller' USING utf8mb4); INSERT INTO utf8mb4_table (utf8mb4_column) SELECT CONVERT(latin1_column USING utf8mb4) FROM latin1_table;To convert strings between character sets, you can also use CONVERT(expr, type) syntax (without USING), or CAST(expr AS type), which is equivalent: CONVERT(string, CHAR[(N)] CHARACTER SET charset_name) CAST(string AS CHAR[(N)] CHARACTER SET charset_name)Examples: SELECT CONVERT('test', CHAR CHARACTER SET utf8mb4); SELECT CAST('test' AS CHAR CHARACTER SET utf8mb4);If you specify CHARACTER SET charset_name as just shown, the character set and collation of the result are charset_name and the default collation of charset_name. If you omit CHARACTER SET charset_name, the character set and collation of the result are defined by the character_set_connection and collation_connection system variables that determine the default connection character set and collation (see Section 10.4, “Connection Character Sets and Collations”). A COLLATE clause is not permitted within a CONVERT() or CAST() call, but you can apply it to the function result. For example, these are legal: SELECT CONVERT('test' USING utf8mb4) COLLATE utf8mb4_bin; SELECT CONVERT('test', CHAR CHARACTER SET utf8mb4) COLLATE utf8mb4_bin; SELECT CAST('test' AS CHAR CHARACTER SET utf8mb4) COLLATE utf8mb4_bin;But these are illegal: SELECT CONVERT('test' USING utf8mb4 COLLATE utf8mb4_bin); SELECT CONVERT('test', CHAR CHARACTER SET utf8mb4 COLLATE utf8mb4_bin); SELECT CAST('test' AS CHAR CHARACTER SET utf8mb4 COLLATE utf8mb4_bin);For string literals, another way to specify the character set is to use a character set introducer. _latin1 and _latin2 in the preceding example are instances of introducers. Unlike conversion functions such as CAST(), or CONVERT(), which convert a string from one character set to another, an introducer designates a string literal as having a particular character set, with no conversion involved. For more information, see Section 10.3.8, “Character Set Introducers”.
Character Set Conversions for String ComparisonsNormally, you cannot compare a BLOB value or other binary string in case-insensitive fashion because binary strings use the binary character set, which has no collation with the concept of lettercase. To perform a case-insensitive comparison, first use the CONVERT() or CAST() function to convert the value to a nonbinary string. Comparisons of the resulting string use its collation. For example, if the conversion result collation is not case-sensitive, a LIKE operation is not case-sensitive. That is true for the following operation because the default utf8mb4 collation (utf8mb4_0900_ai_ci) is not case-sensitive: SELECT 'A' LIKE CONVERT(blob_col USING utf8mb4) FROM tbl_name;To specify a particular collation for the converted string, use a COLLATE clause following the CONVERT() call: SELECT 'A' LIKE CONVERT(blob_col USING utf8mb4) COLLATE utf8mb4_unicode_ci FROM tbl_name;To use a different character set, substitute its name for utf8mb4 in the preceding statements (and similarly to use a different collation). CONVERT() and CAST() can be used more generally for comparing strings represented in different character sets. For example, a comparison of these strings results in an error because they have different character sets: mysql> SET @s1 = _latin1 'abc', @s2 = _latin2 'abc'; mysql> SELECT @s1 = @s2; ERROR 1267 (HY000): Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (latin2_general_ci,IMPLICIT) for operation '='Converting one of the strings to a character set compatible with the other enables the comparison to occur without error: mysql> SELECT @s1 = CONVERT(@s2 USING latin1); +---------------------------------+ | @s1 = CONVERT(@s2 USING latin1) | +---------------------------------+ | 1 | +---------------------------------+Character set conversion is also useful preceding lettercase conversion of binary strings. LOWER() and UPPER() are ineffective when applied directly to binary strings because the concept of lettercase does not apply. To perform lettercase conversion of a binary string, first convert it to a nonbinary string using a character set appropriate for the data stored in the string: mysql> SET @str = BINARY 'New York'; mysql> SELECT LOWER(@str), LOWER(CONVERT(@str USING utf8mb4)); +-------------+------------------------------------+ | LOWER(@str) | LOWER(CONVERT(@str USING utf8mb4)) | +-------------+------------------------------------+ | New York | new york | +-------------+------------------------------------+Be aware that if you apply BINARY, CAST(), or CONVERT() to an indexed column, MySQL may not be able to use the index efficiently.
Cast Operations on Spatial TypesAs of MySQL 8.0.24, CAST() and CONVERT() support casting geometry values from one spatial type to another, for certain combinations of spatial types. The following list shows the permitted type combinations, where “MySQL extension” designates casts implemented in MySQL beyond those defined in the SQL/MM standard:
In spatial casts, GeometryCollection and GeomCollection are synonyms for the same result type. Some conditions apply to all spatial type casts, and some conditions apply only when the cast result is to have a particular spatial type. For information about terms such as “well-formed geometry,” see Section 11.4.4, “Geometry Well-Formedness and Validity”. These conditions apply to all spatial casts regardless of the result type:
When the cast result type is Point, these conditions apply:
When the cast result type is LineString, these conditions apply:
When the cast result type is Polygon, these conditions apply:
When the cast result type is MultiPoint, these conditions apply:
When the cast result type is MultiLineString, these conditions apply:
When the cast result type is MultiPolygon, these conditions apply:
When the cast result type is GeometryCollection, these conditions apply:
Other Uses for Cast OperationsThe cast functions are useful for creating a column with a specific type in a CREATE TABLE ... SELECT statement: mysql> CREATE TABLE new_table SELECT CAST('2000-01-01' AS DATE) AS c1; mysql> SHOW CREATE TABLE new_table\G *************************** 1. row *************************** Table: new_table Create Table: CREATE TABLE `new_table` ( `c1` date DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4The cast functions are useful for sorting ENUM columns in lexical order. Normally, sorting of ENUM columns occurs using the internal numeric values. Casting the values to CHAR results in a lexical sort: SELECT enum_col FROM tbl_name ORDER BY CAST(enum_col AS CHAR);CAST() also changes the result if you use it as part of a more complex expression such as CONCAT('Date: ',CAST(NOW() AS DATE)). For temporal values, there is little need to use CAST() to extract data in different formats. Instead, use a function such as EXTRACT(), DATE_FORMAT(), or TIME_FORMAT(). See Section 12.7, “Date and Time Functions”. To cast a string to a number, it normally suffices to use the string value in numeric context: mysql> SELECT 1+'1'; -> 2That is also true for hexadecimal and bit literals, which are binary strings by default: mysql> SELECT X'41', X'41'+0; -> 'A', 65 mysql> SELECT b'1100001', b'1100001'+0; -> 'a', 97A string used in an arithmetic operation is converted to a floating-point number during expression evaluation. A number used in string context is converted to a string: mysql> SELECT CONCAT('hello you ',2); -> 'hello you 2'For information about implicit conversion of numbers to strings, see Section 12.3, “Type Conversion in Expression Evaluation”. MySQL supports arithmetic with both signed and unsigned 64-bit values. For numeric operators (such as + or -) where one of the operands is an unsigned integer, the result is unsigned by default (see Section 12.6.1, “Arithmetic Operators”). To override this, use the SIGNED or UNSIGNED cast operator to cast a value to a signed or unsigned 64-bit integer, respectively. mysql> SELECT 1 - 2; -> -1 mysql> SELECT CAST(1 - 2 AS UNSIGNED); -> 18446744073709551615 mysql> SELECT CAST(CAST(1 - 2 AS UNSIGNED) AS SIGNED); -> -1If either operand is a floating-point value, the result is a floating-point value and is not affected by the preceding rule. (In this context, DECIMAL column values are regarded as floating-point values.) mysql> SELECT CAST(1 AS UNSIGNED) - 2.0; -> -1.0The SQL mode affects the result of conversion operations (see Section 5.1.11, “Server SQL Modes”). Examples:
Page 17
Table 12.16 XML Functions
This section discusses XML and related functionality in MySQL. Two functions providing basic XPath 1.0 (XML Path Language, version 1.0) capabilities are available. Some basic information about XPath syntax and usage is provided later in this section; however, an in-depth discussion of these topics is beyond the scope of this manual, and you should refer to the XML Path Language (XPath) 1.0 standard for definitive information. A useful resource for those new to XPath or who desire a refresher in the basics is the Zvon.org XPath Tutorial, which is available in several languages.
Note These functions remain under development. We continue to improve these and other aspects of XML and XPath functionality in MySQL 8.0 and onwards. You may discuss these, ask questions about them, and obtain help from other users with them in the MySQL XML User Forum. XPath expressions used with these functions support user variables and local stored program variables. User variables are weakly checked; variables local to stored programs are strongly checked (see also Bug #26518):
Expressions containing user variables or variables local to stored programs must otherwise (except for notation) conform to the rules for XPath expressions containing variables as given in the XPath 1.0 specification.
Note A user variable used to store an XPath expression is treated as an empty string. Because of this, it is not possible to store an XPath expression as a user variable. (Bug #32911)
Note A discussion in depth of XPath syntax and usage are beyond the scope of this manual. Please see the XML Path Language (XPath) 1.0 specification for definitive information. A useful resource for those new to XPath or who are wishing a refresher in the basics is the Zvon.org XPath Tutorial, which is available in several languages. Descriptions and examples of some basic XPath expressions follow:
XPath Limitations. The XPath syntax supported by these functions is currently subject to the following limitations:
XPath expressions passed as arguments to ExtractValue() and UpdateXML() may contain the colon character (:) in element selectors, which enables their use with markup employing XML namespaces notation. For example: mysql> SET @xml = '<a>111<b:c>222<d>333</d><e:f>444</e:f></b:c></a>'; Query OK, 0 rows affected (0.00 sec) mysql> SELECT ExtractValue(@xml, '//e:f'); +-----------------------------+ | ExtractValue(@xml, '//e:f') | +-----------------------------+ | 444 | +-----------------------------+ 1 row in set (0.00 sec) mysql> SELECT UpdateXML(@xml, '//b:c', '<g:h>555</g:h>'); +--------------------------------------------+ | UpdateXML(@xml, '//b:c', '<g:h>555</g:h>') | +--------------------------------------------+ | <a>111<g:h>555</g:h></a> | +--------------------------------------------+ 1 row in set (0.00 sec)This is similar in some respects to what is permitted by Apache Xalan and some other parsers, and is much simpler than requiring namespace declarations or the use of the namespace-uri() and local-name() functions. Error handling. For both ExtractValue() and UpdateXML(), the XPath locator used must be valid and the XML to be searched must consist of elements which are properly nested and closed. If the locator is invalid, an error is generated: mysql> SELECT ExtractValue('<a>c</a><b/>', '/&a'); ERROR 1105 (HY000): XPATH syntax error: '&a'If xml_frag does not consist of elements which are properly nested and closed, NULL is returned and a warning is generated, as shown in this example: mysql> SELECT ExtractValue('<a>c</a><b', '//a'); +-----------------------------------+ | ExtractValue('<a>c</a><b', '//a') | +-----------------------------------+ | NULL | +-----------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> SHOW WARNINGS\G *************************** 1. row *************************** Level: Warning Code: 1525 Message: Incorrect XML value: 'parse error at line 1 pos 11: END-OF-INPUT unexpected ('>' wanted)' 1 row in set (0.00 sec) mysql> SELECT ExtractValue('<a>c</a><b/>', '//a'); +-------------------------------------+ | ExtractValue('<a>c</a><b/>', '//a') | +-------------------------------------+ | c | +-------------------------------------+ 1 row in set (0.00 sec)
Important The replacement XML used as the third argument to UpdateXML() is not checked to determine whether it consists solely of elements which are properly nested and closed. XPath Injection. code injection occurs when malicious code is introduced into the system to gain unauthorized access to privileges and data. It is based on exploiting assumptions made by developers about the type and content of data input from users. XPath is no exception in this regard. A common scenario in which this can happen is the case of application which handles authorization by matching the combination of a login name and password with those found in an XML file, using an XPath expression like this one: //user[login/text()='neapolitan' and password/text()='1c3cr34m']/attribute::idThis is the XPath equivalent of an SQL statement like this one: SELECT id FROM users WHERE login='neapolitan' AND password='1c3cr34m';A PHP application employing XPath might handle the login process like this: <?php $file = "users.xml"; $login = $POST["login"]; $password = $POST["password"]; $xpath = "//user[login/text()=$login and password/text()=$password]/attribute::id"; if( file_exists($file) ) { $xml = simplexml_load_file($file); if($result = $xml->xpath($xpath)) echo "You are now logged in as user $result[0]."; else echo "Invalid login name or password."; } else exit("Failed to open $file."); ?>No checks are performed on the input. This means that a malevolent user can “short-circuit” the test by entering ' or 1=1 for both the login name and password, resulting in $xpath being evaluated as shown here: //user[login/text()='' or 1=1 and password/text()='' or 1=1]/attribute::idSince the expression inside the square brackets always evaluates as true, it is effectively the same as this one, which matches the id attribute of every user element in the XML document: //user/attribute::idOne way in which this particular attack can be circumvented is simply by quoting the variable names to be interpolated in the definition of $xpath, forcing the values passed from a Web form to be converted to strings: $xpath = "//user[login/text()='$login' and password/text()='$password']/attribute::id";This is the same strategy that is often recommended for preventing SQL injection attacks. In general, the practices you should follow for preventing XPath injection attacks are the same as for preventing SQL injection:
Just as SQL injection attacks can be used to obtain information about database schemas, so can XPath injection be used to traverse XML files to uncover their structure, as discussed in Amit Klein's paper Blind XPath Injection (PDF file, 46KB). It is also important to check the output being sent back to the client. Consider what can happen when we use the MySQL ExtractValue() function: mysql> SELECT ExtractValue( -> LOAD_FILE('users.xml'), -> '//user[login/text()="" or 1=1 and password/text()="" or 1=1]/attribute::id' -> ) AS id; +-------------------------------+ | id | +-------------------------------+ | 00327 13579 02403 42354 28570 | +-------------------------------+ 1 row in set (0.01 sec)Because ExtractValue() returns multiple matches as a single space-delimited string, this injection attack provides every valid ID contained within users.xml to the user as a single row of output. As an extra safeguard, you should also test output before returning it to the user. Here is a simple example: mysql> SELECT @id = ExtractValue( -> LOAD_FILE('users.xml'), -> '//user[login/text()="" or 1=1 and password/text()="" or 1=1]/attribute::id' -> ); Query OK, 0 rows affected (0.00 sec) mysql> SELECT IF( -> INSTR(@id, ' ') = 0, -> @id, -> 'Unable to retrieve user ID') -> AS singleID; +----------------------------+ | singleID | +----------------------------+ | Unable to retrieve user ID | +----------------------------+ 1 row in set (0.00 sec)In general, the guidelines for returning data to users securely are the same as for accepting user input. These can be summed up as:
Page 18
12.13 Bit Functions and Operators
Table 12.17 Bit Functions and Operators
Bit functions and operators comprise BIT_COUNT(), BIT_AND(), BIT_OR(), BIT_XOR(), &, |, ^, ~, <<, and >>. (The BIT_AND(), BIT_OR(), and BIT_XOR() aggregate functions are described in Section 12.20.1, “Aggregate Function Descriptions”.) Prior to MySQL 8.0, bit functions and operators required BIGINT (64-bit integer) arguments and returned BIGINT values, so they had a maximum range of 64 bits. Non-BIGINT arguments were converted to BIGINT prior to performing the operation and truncation could occur. In MySQL 8.0, bit functions and operators permit binary string type arguments (BINARY, VARBINARY, and the BLOB types) and return a value of like type, which enables them to take arguments and produce return values larger than 64 bits. Nonbinary string arguments are converted to BIGINT and processed as such, as before. An implication of this change in behavior is that bit operations on binary string arguments might produce a different result in MySQL 8.0 than in 5.7. For information about how to prepare in MySQL 5.7 for potential incompatibilities between MySQL 5.7 and 8.0, see Bit Functions and Operators, in MySQL 5.7 Reference Manual.
Bit Operations Prior to MySQL 8.0Bit operations prior to MySQL 8.0 handle only unsigned 64-bit integer argument and result values (that is, unsigned BIGINT values). Conversion of arguments of other types to BIGINT occurs as necessary. Examples:
Bit Operations in MySQL 8.0MySQL 8.0 extends bit operations to handle binary string arguments directly (without conversion) and produce binary string results. (Arguments that are not integers or binary strings are still converted to integers, as before.) This extension enhances bit operations in the following ways:
For example, consider UUID values and IPv6 addresses, which have human-readable text formats like this: UUID: 6ccd780c-baba-1026-9564-5b8c656024db IPv6: fe80::219:d1ff:fe91:1a72It is cumbersome to operate on text strings in those formats. An alternative is convert them to fixed-length binary strings without delimiters. UUID_TO_BIN() and INET6_ATON() each produce a value of data type BINARY(16), a binary string 16 bytes (128 bits) long. The following statements illustrate this (HEX() is used to produce displayable values): mysql> SELECT HEX(UUID_TO_BIN('6ccd780c-baba-1026-9564-5b8c656024db')); +----------------------------------------------------------+ | HEX(UUID_TO_BIN('6ccd780c-baba-1026-9564-5b8c656024db')) | +----------------------------------------------------------+ | 6CCD780CBABA102695645B8C656024DB | +----------------------------------------------------------+ mysql> SELECT HEX(INET6_ATON('fe80::219:d1ff:fe91:1a72')); +---------------------------------------------+ | HEX(INET6_ATON('fe80::219:d1ff:fe91:1a72')) | +---------------------------------------------+ | FE800000000000000219D1FFFE911A72 | +---------------------------------------------+Those binary values are easily manipulable with bit operations to perform actions such as extracting the timestamp from UUID values, or extracting the network and host parts of IPv6 addresses. (For examples, see later in this discussion.) Arguments that count as binary strings include column values, routine parameters, local variables, and user-defined variables that have a binary string type: BINARY, VARBINARY, or one of the BLOB types. What about hexadecimal literals and bit literals? Recall that those are binary strings by default in MySQL, but numbers in numeric context. How are they handled for bit operations in MySQL 8.0? Does MySQL continue to evaluate them in numeric context, as is done prior to MySQL 8.0? Or do bit operations evaluate them as binary strings, now that binary strings can be handled “natively” without conversion? Answer: It has been common to specify arguments to bit operations using hexadecimal literals or bit literals with the intent that they represent numbers, so MySQL continues to evaluate bit operations in numeric context when all bit arguments are hexadecimal or bit literals, for backward compatility. If you require evaluation as binary strings instead, that is easily accomplished: Use the _binary introducer for at least one literal.
Although the bit operations in both statements produce a result with a numeric value of 65, the second statement operates in binary-string context, for which 65 is ASCII A. In numeric evaluation context, permitted values of hexadecimal literal and bit literal arguments have a maximum of 64 bits, as do results. By contrast, in binary-string evaluation context, permitted arguments (and results) can exceed 64 bits: mysql> SELECT _binary X'4040404040404040' | X'0102030405060708'; +---------------------------------------------------+ | _binary X'4040404040404040' | X'0102030405060708' | +---------------------------------------------------+ | ABCDEFGH | +---------------------------------------------------+There are several ways to refer to a hexadecimal literal or bit literal in a bit operation to cause binary-string evaluation: _binary literal BINARY literal CAST(literal AS BINARY)Another way to produce binary-string evaluation of hexadecimal literals or bit literals is to assign them to user-defined variables, which results in variables that have a binary string type: mysql> SET @v1 = X'40', @v2 = X'01', @v3 = b'11110001', @v4 = b'01001111'; mysql> SELECT @v1 | @v2, @v3 & @v4; +-----------+-----------+ | @v1 | @v2 | @v3 & @v4 | +-----------+-----------+ | A | A | +-----------+-----------+In binary-string context, bitwise operation arguments must have the same length or an ER_INVALID_BITWISE_OPERANDS_SIZE error occurs: mysql> SELECT _binary X'40' | X'0001'; ERROR 3513 (HY000): Binary operands of bitwise operators must be of equal lengthTo satisfy the equal-length requirement, pad the shorter value with leading zero digits or, if the longer value begins with leading zero digits and a shorter result value is acceptable, strip them: mysql> SELECT _binary X'0040' | X'0001'; +---------------------------+ | _binary X'0040' | X'0001' | +---------------------------+ | A | +---------------------------+ mysql> SELECT _binary X'40' | X'01'; +-----------------------+ | _binary X'40' | X'01' | +-----------------------+ | A | +-----------------------+Padding or stripping can also be accomplished using functions such as LPAD(), RPAD(), SUBSTR(), or CAST(). In such cases, the expression arguments are no longer all literals and _binary becomes unnecessary. Examples: mysql> SELECT LPAD(X'40', 2, X'00') | X'0001'; +---------------------------------+ | LPAD(X'40', 2, X'00') | X'0001' | +---------------------------------+ | A | +---------------------------------+ mysql> SELECT X'40' | SUBSTR(X'0001', 2, 1); +-------------------------------+ | X'40' | SUBSTR(X'0001', 2, 1) | +-------------------------------+ | A | +-------------------------------+
Binary String Bit-Operation ExamplesThe following example illustrates use of bit operations to extract parts of a UUID value, in this case, the timestamp and IEEE 802 node number. This technique requires bitmasks for each extracted part. Convert the text UUID to the corresponding 16-byte binary value so that it can be manipulated using bit operations in binary-string context: mysql> SET @uuid = UUID_TO_BIN('6ccd780c-baba-1026-9564-5b8c656024db'); mysql> SELECT HEX(@uuid); +----------------------------------+ | HEX(@uuid) | +----------------------------------+ | 6CCD780CBABA102695645B8C656024DB | +----------------------------------+Construct bitmasks for the timestamp and node number parts of the value. The timestamp comprises the first three parts (64 bits, bits 0 to 63) and the node number is the last part (48 bits, bits 80 to 127): mysql> SET @ts_mask = CAST(X'FFFFFFFFFFFFFFFF' AS BINARY(16)); mysql> SET @node_mask = CAST(X'FFFFFFFFFFFF' AS BINARY(16)) >> 80; mysql> SELECT HEX(@ts_mask); +----------------------------------+ | HEX(@ts_mask) | +----------------------------------+ | FFFFFFFFFFFFFFFF0000000000000000 | +----------------------------------+ mysql> SELECT HEX(@node_mask); +----------------------------------+ | HEX(@node_mask) | +----------------------------------+ | 00000000000000000000FFFFFFFFFFFF | +----------------------------------+The CAST(... AS BINARY(16)) function is used here because the masks must be the same length as the UUID value against which they are applied. The same result can be produced using other functions to pad the masks to the required length: SET @ts_mask= RPAD(X'FFFFFFFFFFFFFFFF' , 16, X'00'); SET @node_mask = LPAD(X'FFFFFFFFFFFF', 16, X'00') ;Use the masks to extract the timestamp and node number parts: mysql> SELECT HEX(@uuid & @ts_mask) AS 'timestamp part'; +----------------------------------+ | timestamp part | +----------------------------------+ | 6CCD780CBABA10260000000000000000 | +----------------------------------+ mysql> SELECT HEX(@uuid & @node_mask) AS 'node part'; +----------------------------------+ | node part | +----------------------------------+ | 000000000000000000005B8C656024DB | +----------------------------------+The preceding example uses these bit operations: right shift (>>) and bitwise AND (&).
Note UUID_TO_BIN() takes a flag that causes some bit rearrangement in the resulting binary UUID value. If you use that flag, modify the extraction masks accordingly. The next example uses bit operations to extract the network and host parts of an IPv6 address. Suppose that the network part has a length of 80 bits. Then the host part has a length of 128 − 80 = 48 bits. To extract the network and host parts of the address, convert it to a binary string, then use bit operations in binary-string context. Convert the text IPv6 address to the corresponding binary string: mysql> SET @ip = INET6_ATON('fe80::219:d1ff:fe91:1a72');Define the network length in bits: mysql> SET @net_len = 80;Construct network and host masks by shifting the all-ones address left or right. To do this, begin with the address ::, which is shorthand for all zeros, as you can see by converting it to a binary string like this: mysql> SELECT HEX(INET6_ATON('::')) AS 'all zeros'; +----------------------------------+ | all zeros | +----------------------------------+ | 00000000000000000000000000000000 | +----------------------------------+To produce the complementary value (all ones), use the ~ operator to invert the bits: mysql> SELECT HEX(~INET6_ATON('::')) AS 'all ones'; +----------------------------------+ | all ones | +----------------------------------+ | FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | +----------------------------------+Shift the all-ones value left or right to produce the network and host masks: mysql> SET @net_mask = ~INET6_ATON('::') << (128 - @net_len); mysql> SET @host_mask = ~INET6_ATON('::') >> @net_len;Display the masks to verify that they cover the correct parts of the address: mysql> SELECT INET6_NTOA(@net_mask) AS 'network mask'; +----------------------------+ | network mask | +----------------------------+ | ffff:ffff:ffff:ffff:ffff:: | +----------------------------+ mysql> SELECT INET6_NTOA(@host_mask) AS 'host mask'; +------------------------+ | host mask | +------------------------+ | ::ffff:255.255.255.255 | +------------------------+Extract and display the network and host parts of the address: mysql> SET @net_part = @ip & @net_mask; mysql> SET @host_part = @ip & @host_mask; mysql> SELECT INET6_NTOA(@net_part) AS 'network part'; +-----------------+ | network part | +-----------------+ | fe80::219:0:0:0 | +-----------------+ mysql> SELECT INET6_NTOA(@host_part) AS 'host part'; +------------------+ | host part | +------------------+ | ::d1ff:fe91:1a72 | +------------------+The preceding example uses these bit operations: Complement (~), left shift (<<), and bitwise AND (&). The remaining discussion provides details on argument handling for each group of bit operations, more information about literal-value handling in bit operations, and potential incompatibilities between MySQL 8.0 and older MySQL versions.
Bitwise AND, OR, and XOR OperationsFor &, |, and ^ bit operations, the result type depends on whether the arguments are evaluated as binary strings or numbers:
Examples of numeric evaluation: mysql> SELECT 64 | 1, X'40' | X'01'; +--------+---------------+ | 64 | 1 | X'40' | X'01' | +--------+---------------+ | 65 | 65 | +--------+---------------+Examples of binary-string evaluation: mysql> SELECT _binary X'40' | X'01'; +-----------------------+ | _binary X'40' | X'01' | +-----------------------+ | A | +-----------------------+ mysql> SET @var1 = X'40', @var2 = X'01'; mysql> SELECT @var1 | @var2; +---------------+ | @var1 | @var2 | +---------------+ | A | +---------------+
Bitwise Complement and Shift OperationsFor ~, <<, and >> bit operations, the result type depends on whether the bit argument is evaluated as a binary string or number:
For shift operations, bits shifted off the end of the value are lost without warning, regardless of the argument type. In particular, if the shift count is greater or equal to the number of bits in the bit argument, all bits in the result are 0. Examples of numeric evaluation: mysql> SELECT ~0, 64 << 2, X'40' << 2; +----------------------+---------+------------+ | ~0 | 64 << 2 | X'40' << 2 | +----------------------+---------+------------+ | 18446744073709551615 | 256 | 256 | +----------------------+---------+------------+Examples of binary-string evaluation: mysql> SELECT HEX(_binary X'1111000022220000' >> 16); +----------------------------------------+ | HEX(_binary X'1111000022220000' >> 16) | +----------------------------------------+ | 0000111100002222 | +----------------------------------------+ mysql> SELECT HEX(_binary X'1111000022220000' << 16); +----------------------------------------+ | HEX(_binary X'1111000022220000' << 16) | +----------------------------------------+ | 0000222200000000 | +----------------------------------------+ mysql> SET @var1 = X'F0F0F0F0'; mysql> SELECT HEX(~@var1); +-------------+ | HEX(~@var1) | +-------------+ | 0F0F0F0F | +-------------+
The BIT_COUNT() function always returns an unsigned 64-bit integer, or NULL if the argument is NULL. mysql> SELECT BIT_COUNT(127); +----------------+ | BIT_COUNT(127) | +----------------+ | 7 | +----------------+ mysql> SELECT BIT_COUNT(b'010101'), BIT_COUNT(_binary b'010101'); +----------------------+------------------------------+ | BIT_COUNT(b'010101') | BIT_COUNT(_binary b'010101') | +----------------------+------------------------------+ | 3 | 3 | +----------------------+------------------------------+
BIT_AND(), BIT_OR(), and BIT_XOR() OperationsFor the BIT_AND(), BIT_OR(), and BIT_XOR() bit functions, the result type depends on whether the function argument values are evaluated as binary strings or numbers:
NULL values do not affect the result unless all values are NULL. In that case, the result is a neutral value having the same length as the length of the argument values (all bits 1 for BIT_AND(), all bits 0 for BIT_OR(), and BIT_XOR()). Example: mysql> CREATE TABLE t (group_id INT, a VARBINARY(6)); mysql> INSERT INTO t VALUES (1, NULL); mysql> INSERT INTO t VALUES (1, NULL); mysql> INSERT INTO t VALUES (2, NULL); mysql> INSERT INTO t VALUES (2, X'1234'); mysql> INSERT INTO t VALUES (2, X'FF34'); mysql> SELECT HEX(BIT_AND(a)), HEX(BIT_OR(a)), HEX(BIT_XOR(a)) FROM t GROUP BY group_id; +-----------------+----------------+-----------------+ | HEX(BIT_AND(a)) | HEX(BIT_OR(a)) | HEX(BIT_XOR(a)) | +-----------------+----------------+-----------------+ | FFFFFFFFFFFF | 000000000000 | 000000000000 | | 1234 | FF34 | ED00 | +-----------------+----------------+-----------------+
Special Handling of Hexadecimal Literals, Bit Literals, and NULL LiteralsFor backward compatibility, MySQL 8.0 evaluates bit operations in numeric context when all bit arguments are hexadecimal literals, bit literals, or NULL literals. That is, bit operations on binary-string bit arguments do not use binary-string evaluation if all bit arguments are unadorned hexadecimal literals, bit literals, or NULL literals. (This does not apply to such literals if they are written with a _binary introducer, BINARY operator, or other way of specifying them explicitly as binary strings.) The literal handling just described is the same as prior to MySQL 8.0. Examples:
In MySQL 8.0, you can cause those operations to evaluate the arguments in binary-string context by indicating explicitly that at least one argument is a binary string: _binary b'0001' | b'0010' _binary X'0008' << 8 BINARY NULL & NULL BINARY NULL >> 4The result of the last two expressions is NULL, just as without the BINARY operator, but the data type of the result is a binary string type rather than an integer type.
Bit-Operation Incompatibilities with MySQL 5.7Because bit operations can handle binary string arguments natively in MySQL 8.0, some expressions produce a different result in MySQL 8.0 than in 5.7. The five problematic expression types to watch out for are: nonliteral_binary { & | ^ } binary binary { & | ^ } nonliteral_binary nonliteral_binary { << >> } anything ~ nonliteral_binary AGGR_BIT_FUNC(nonliteral_binary)Those expressions return BIGINT in MySQL 5.7, binary string in 8.0. Explanation of notation:
For information about how to prepare in MySQL 5.7 for potential incompatibilities between MySQL 5.7 and 8.0, see Bit Functions and Operators, in MySQL 5.7 Reference Manual. The following list describes available bit functions and operators:
Page 19
12.14 Encryption and Compression FunctionsMany encryption and compression functions return strings for which the result might contain arbitrary byte values. If you want to store these results, use a column with a VARBINARY or BLOB binary string data type. This avoids potential problems with trailing space removal or character set conversion that would change data values, such as may occur if you use a nonbinary string data type (CHAR, VARCHAR, TEXT). Some encryption functions return strings of ASCII characters: MD5(), SHA(), SHA1(), SHA2(), STATEMENT_DIGEST(), STATEMENT_DIGEST_TEXT(). Their return value is a string that has a character set and collation determined by the character_set_connection and collation_connection system variables. This is a nonbinary string unless the character set is binary. If an application stores values from a function such as MD5() or SHA1() that returns a string of hex digits, more efficient storage and comparisons can be obtained by converting the hex representation to binary using UNHEX() and storing the result in a BINARY(N) column. Each pair of hexadecimal digits requires one byte in binary form, so the value of N depends on the length of the hex string. N is 16 for an MD5() value and 20 for a SHA1() value. For SHA2(), N ranges from 28 to 32 depending on the argument specifying the desired bit length of the result. The size penalty for storing the hex string in a CHAR column is at least two times, up to eight times if the value is stored in a column that uses the utf8 character set (where each character uses 4 bytes). Storing the string also results in slower comparisons because of the larger values and the need to take character set collation rules into account. Suppose that an application stores MD5() string values in a CHAR(32) column: CREATE TABLE md5_tbl (md5_val CHAR(32), ...); INSERT INTO md5_tbl (md5_val, ...) VALUES(MD5('abcdef'), ...);To convert hex strings to more compact form, modify the application to use UNHEX() and BINARY(16) instead as follows: CREATE TABLE md5_tbl (md5_val BINARY(16), ...); INSERT INTO md5_tbl (md5_val, ...) VALUES(UNHEX(MD5('abcdef')), ...);Applications should be prepared to handle the very rare case that a hashing function produces the same value for two different input values. One way to make collisions detectable is to make the hash column a primary key.
Note Exploits for the MD5 and SHA-1 algorithms have become known. You may wish to consider using another one-way encryption function described in this section instead, such as SHA2().
Caution Passwords or other sensitive values supplied as arguments to encryption functions are sent as cleartext to the MySQL server unless an SSL connection is used. Also, such values appear in any MySQL logs to which they are written. To avoid these types of exposure, applications can encrypt sensitive values on the client side before sending them to the server. The same considerations apply to encryption keys. To avoid exposing these, applications can use stored procedures to encrypt and decrypt values on the server side.
Page 20GET_LOCK(str,timeout) Tries to obtain a lock with a name given by the string str, using a timeout of timeout seconds. A negative timeout value means infinite timeout. The lock is exclusive. While held by one session, other sessions cannot obtain a lock of the same name. Returns 1 if the lock was obtained successfully, 0 if the attempt timed out (for example, because another client has previously locked the name), or NULL if an error occurred (such as running out of memory or the thread was killed with mysqladmin kill). A lock obtained with GET_LOCK() is released explicitly by executing RELEASE_LOCK() or implicitly when your session terminates (either normally or abnormally). Locks obtained with GET_LOCK() are not released when transactions commit or roll back. GET_LOCK() is implemented using the metadata locking (MDL) subsystem. Multiple simultaneous locks can be acquired and GET_LOCK() does not release any existing locks. For example, suppose that you execute these statements: SELECT GET_LOCK('lock1',10); SELECT GET_LOCK('lock2',10); SELECT RELEASE_LOCK('lock2'); SELECT RELEASE_LOCK('lock1');The second GET_LOCK() acquires a second lock and both RELEASE_LOCK() calls return 1 (success). It is even possible for a given session to acquire multiple locks for the same name. Other sessions cannot acquire a lock with that name until the acquiring session releases all its locks for the name. Uniquely named locks acquired with GET_LOCK() appear in the Performance Schema metadata_locks table. The OBJECT_TYPE column says USER LEVEL LOCK and the OBJECT_NAME column indicates the lock name. In the case that multiple locks are acquired for the same name, only the first lock for the name registers a row in the metadata_locks table. Subsequent locks for the name increment a counter in the lock but do not acquire additional metadata locks. The metadata_locks row for the lock is deleted when the last lock instance on the name is released. The capability of acquiring multiple locks means there is the possibility of deadlock among clients. When this happens, the server chooses a caller and terminates its lock-acquisition request with an ER_USER_LOCK_DEADLOCK error. This error does not cause transactions to roll back. MySQL enforces a maximum length on lock names of 64 characters. GET_LOCK() can be used to implement application locks or to simulate record locks. Names are locked on a server-wide basis. If a name has been locked within one session, GET_LOCK() blocks any request by another session for a lock with the same name. This enables clients that agree on a given lock name to use the name to perform cooperative advisory locking. But be aware that it also enables a client that is not among the set of cooperating clients to lock a name, either inadvertently or deliberately, and thus prevent any of the cooperating clients from locking that name. One way to reduce the likelihood of this is to use lock names that are database-specific or application-specific. For example, use lock names of the form db_name.str or app_name.str. If multiple clients are waiting for a lock, the order in which they acquire it is undefined. Applications should not assume that clients acquire the lock in the same order that they issued the lock requests. GET_LOCK() is unsafe for statement-based replication. A warning is logged if you use this function when binlog_format is set to STATEMENT. Since GET_LOCK() establishes a lock only on a single mysqld, it is not suitable for use with NDB Cluster, which has no way of enforcing an SQL lock across multiple MySQL servers. See Section 23.2.7.10, “Limitations Relating to Multiple NDB Cluster Nodes”, for more information.
With the capability of acquiring multiple named locks, it is possible for a single statement to acquire a large number of locks. For example: INSERT INTO ... SELECT GET_LOCK(t1.col_name) FROM t1;These types of statements may have certain adverse effects. For example, if the statement fails part way through and rolls back, locks acquired up to the point of failure still exist. If the intent is for there to be a correspondence between rows inserted and locks acquired, that intent is not satisfied. Also, if it is important that locks are granted in a certain order, be aware that result set order may differ depending on which execution plan the optimizer chooses. For these reasons, it may be best to limit applications to a single lock-acquisition call per statement. A different locking interface is available as either a plugin service or a set of loadable functions. This interface provides lock namespaces and distinct read and write locks, unlike the interface provided by GET_LOCK() and related functions. For details, see Section 5.6.9.1, “The Locking Service”. IS_FREE_LOCK(str) Checks whether the lock named str is free to use (that is, not locked). Returns 1 if the lock is free (no one is using the lock), 0 if the lock is in use, and NULL if an error occurs (such as an incorrect argument). This function is unsafe for statement-based replication. A warning is logged if you use this function when binlog_format is set to STATEMENT. IS_USED_LOCK(str) Checks whether the lock named str is in use (that is, locked). If so, it returns the connection identifier of the client session that holds the lock. Otherwise, it returns NULL. This function is unsafe for statement-based replication. A warning is logged if you use this function when binlog_format is set to STATEMENT. RELEASE_ALL_LOCKS() Releases all named locks held by the current session and returns the number of locks released (0 if there were none) This function is unsafe for statement-based replication. A warning is logged if you use this function when binlog_format is set to STATEMENT. RELEASE_LOCK(str) Releases the lock named by the string str that was obtained with GET_LOCK(). Returns 1 if the lock was released, 0 if the lock was not established by this thread (in which case the lock is not released), and NULL if the named lock did not exist. The lock does not exist if it was never obtained by a call to GET_LOCK() or if it has previously been released. The DO statement is convenient to use with RELEASE_LOCK(). See Section 13.2.3, “DO Statement”. This function is unsafe for statement-based replication. A warning is logged if you use this function when binlog_format is set to STATEMENT. Page 21
Page 22
|