可以看到上面的SQL语句变成了相关子查询,通过EXPLAIN EXTENDED 和 SHOW WARNINGS命令,可以看到如下结果:
select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`driver` where ((`northwind`.`driver`.`_create_date` > '2016-07-25 00:00:00') and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`))))
可以看出MySql优化器直接把IN子句转换成了EXISTS的相关子查询。下面这条相关IN子查询:
SELECT driver_id FROM driver where driver_id in (SELECT driver_id FROM user where user.uid = driver.driver_id);
查看SQL语句的执行计划:
就是相关子查询,通过EXPLAIN EXTENDED 和 SHOW WARNINGS命令,看到如下结果:
select `northwind`.`driver`.`driver_id` AS `driver_id` from `northwind`.`driver` where <in_optimizer>(`northwind`.`driver`.`driver_id`,<exists>(select 1 from `northwind`.`user` where ((`northwind`.`user`.`uid` = `northwind`.`driver`.`driver_id`) and (<cache>(`northwind`.`driver`.`driver_id`) = `northwind`.`driver`.`driver_id`))))
IN和EXISTS哪个快?
网上百度到很多认为IN和EXISTS效率一样是错误的文章。
如果查询的两个表大小相当,那么用in和exists差别不大。
如果两个表中一个较小,一个是大表,则子查询表大的用exists,子查询表小的用in:
例如:表A(小表),表B(大表)
1:
select * from A where cc in (select cc from B) 效率低,用到了A表上cc列的索引;
select * from A where exists(select cc from B where cc=A.cc) 效率高,用到了B表上cc列的索引。
相反的
2:
select * from B where cc in (select cc from A) 效率高,用到了B表上cc列的索引;
select * from B where exists(select cc from A where cc=B.cc) 效率低,用到了A表上cc列的索引。
总结上面的描述,个人认为其主要的原因在于对索引的使用。任何情况下,只要是大表的索引被使用,就可以使效率提高。
但是在编辑本文的时候,多次测试,却没能得到上面所总结的结果。下面是测试SQL语句,先是外表为大表,内表为小表。(示例一)
SELECT count(driver_id) FROM driver where driver_id in (SELECT uid FROM user);
SELECT count(driver_id) FROM driver where exists (SELECT 1 FROM user where uid = driver.driver_id);
执行结果是:
再是外表是小表,内表是大表。(示例二)
select count(uid) from user where uid in (SELECT driver_id FROM driver);
select count(uid) from user where exists (SELECT 1 FROM driver where driver.driver_id = user.uid);
Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query.将子查询转变为一个连接,或是利用table pullout并将查询作为子查询表和外表之间的一个内连接来执行。Table pullout会为外部查询从子查询抽取出一个表。
有些时候,一个子查询可以被重写为JOIN,例如:
SELECT OrderID FROM Orders where EmployeeID IN (select EmployeeID from Employees where EmployeeID > 3);
如果知道OrderID是唯一的,即主键或者唯一索引,那么SQL语句会被重写为Join形式。
SELECT OrderID FROM Orders join Employees where Orders.EmployeeID = Employees.EmployeeID and Employees.EmployeeID > 3;
select `northwind`.`Orders`.`OrderID` AS `OrderID` from `northwind`.`Orders` where <in_optimizer>(`northwind`.`Orders`.`EmployeeID`,<exists>(<primary_index_lookup>(<cache>(`northwind`.`Orders`.`EmployeeID`) in Employees on PRIMARY where ((`northwind`.`Employees`.`EmployeeID` > 3) and (<cache>(`northwind`.`Orders`.`EmployeeID`) = `northwind`.`Employees`.`EmployeeID`)))))
正是上面说的in为什么慢?
在MySql 5.6中,优化器会对SQL语句重写,得到的执行计划:
在MySql 5.6中,优化器没有将独立子查询重写为相关子查询,通过EXPLAIN EXTENDED 和 SHOW WARNINGS命令,得到优化器的执行方式为:
/* select#1 */ select `northwind`.`orders`.`OrderID` AS `OrderID` from `northwind`.`employees` join `northwind`.`orders` where ((`northwind`.`orders`.`EmployeeID` = `northwind`.`employees`.`EmployeeID`) and (`northwind`.`employees`.`EmployeeID` > 3))
很显然,优化器将上述子查询重写为JOIN语句,这就是Table Pullout优化。
Duplicate Weedout优化
Run the semi-join as if it was a join and remove duplicate records using a temporary table.执行半连接,就如同它是一个连接并利用临时表移除了重复的记录。
上面内部表查出的列是唯一的,因此优化器会将子查询重写为JOIN语句,以提高SQL执行的效率。Duplicate Weedout优化是指外部查询条件是列是唯一的,MySql优化器会先将子查询查出的结果进行去重。比如下面这条SQL语句:
SELECT ContactName FROM Customers where CustomerID in (select CustomerID from Orders where OrderID > 10000 and Customers.Country = Orders.ShipCountry);
Extra选项提示的Start temporary表示创建一张去重的临时表,End temporary表示删除该临时表。而通过EXPLAIN EXTENDED 和 SHOW WARNINGS命令,得到优化器的执行方式为:
/* select#1 */ select `northwind`.`customers`.`ContactName` AS `ContactName` from `northwind`.`customers` semi join (`northwind`.`orders`) where ((`northwind`.`customers`.`CustomerID` = `northwind`.`orders`.`CustomerID`) and (`northwind`.`customers`.`Country` = `northwind`.`orders`.`ShipCountry`) and (`northwind`.`orders`.`OrderID` > 10000))
Materialize the subquery into a temporary table with an index and use the temporary table to perform a join. The index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned.
FirstMacth优化
When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows.为了对记录进行合并而在扫描内表,并且对于给定值群组有多个实例时,选择其一而不是将它们全部返回。这为表扫描提供了一个早期退出机制而且还消除了不必要记录的产生。
半连接的最先匹配(FirstMatch)策略执行子查询的方式与MySQL稍早版本中的IN-TO-EXISTS是非常相似的。对于外表中的每条匹配记录,MySQL都会在内表中进行匹配检查。当发现存在匹配时,它会从外表返回记录。只有在未发现匹配的情况下,引擎才会回退去扫描整个内表。
LooseScan优化
Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.利用索引来扫描一个子查询表可以从每个子查询的值群组中选出一个单一的值。
SEMI JOIN变量
Each of these strategies except Duplicate Weedout can be enabled or disabled using the optimizer_switch system variable. The semijoin flag controls whether semi-joins are used. If it is set to on, the firstmatch, loosescan, and materialization flags enable finer control over the permitted semi-join strategies. These flags are on by default.除Duplicate Weedout之外的每个策略可以用变量控制开关,semijoin控制semi-joins优化是否开启,如果设置开启,其他的策略也有独立的变量控制。所有的变量在5.6默认是打开的。
mysql> SELECT @@optimizer_switch\G;
*************************** 1. row ***************************
@@optimizer_switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on
1 row in set (0.00 sec)
EXPLAIN查看策略
Semi-joined tables show up in the outer select. EXPLAIN EXTENDED plus SHOW WARNINGS shows the rewritten query, which displays the semi-join structure. From this you can get an idea about which tables were pulled out of the semi-join. If a subquery was converted to a semi-join, you will see that the subquery predicate is gone and its tables and WHERE clause were merged into the outer query join list and WHERE clause.
Temporary table use for Duplicate Weedout is indicated by Start temporary and End temporary in the Extra column. Tables that were not pulled out and are in the range of EXPLAIN output rows covered by Start temporary and End temporary will have their rowid in the temporary table.
FirstMatch(tbl_name) in the Extra column(列) indicates join shortcutting.
LooseScan(m..n) in the Extra column indicates use of the LooseScan strategy. m and n are key part numbers.
As of MySQL 5.6.7, temporary table use for materialization is indicated by rows with a select_type value of MATERIALIZED and rows with a table value of .
Before MySQL 5.6.7, temporary table use for materialization is indicated in the Extra column by Materialize if a single table is used, or by Start materialize and End materialize if multiple tables are used. If Scan is present, no temporary table index is used for table reads. Otherwise, an index lookup is used.