极客时间-轻松学习，高效学习-极客邦

叶明
2024-10-11 来自江苏
老师，这里要考虑 <sort_key, additional_fields> 吗？因为我看我的语句没你那么慢，1s左右就返回了，你的查询要1分钟多，让我怀疑是不是没走索引查找或者扫描了 c 这个大列。
作者回复: 执行1分钟主要是因为我的测试虚机IO性能太差了。执行计划没有问题。 mysql> explain select count(*) from t_jointab t1, t_jointab t2 where t1.a = t2.a and t1.b = t2.b; +----+-------------+-------+------+---------------+-------+---------+----------+------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | +----+-------------+-------+------+---------------+-------+---------+----------+------+ | 1 | SIMPLE | t1 | ALL | idx_a | NULL | NULL | NULL | 8580 | | 1 | SIMPLE | t2 | ref | idx_a | idx_a | 4 | rep.t1.a | 1 | ### 虚机上执行 1分多 mysql> select count(*) from t_jointab t1, t_jointab t2 where t1.a = t2.a and t1.b = t2.b; +----------+ | count(*) | +----------+ | 10000 | +----------+ 1 row in set (1 min 30.77 sec) 换到笔记本上执行也是1秒多。 ### 笔记本上的执行时间 1.14秒 mysql> select count(*) from t_jointab t1 , t_jointab t2 where t1.a = t2.a and t1.b = t2.b; +----------+ | count(*) | +----------+ | 10000 | +----------+ 1 row in set (1.14 sec)
叶明
2024-10-11 来自江苏
表 t_jointab 中有 1 万条记录，前 9000 条记录的字段 a 是唯一的，后 1000 条记录的字段 a 的值都是 9000 select count(*) from t_jointab t1, t_jointab t2 where t1.a = t2.a and t1.b = t2.b; 1. 先对表 t1 进行全表扫描 2. 拿步骤1 中得到的 a 值再到表 t2 的索引 idx_a 中去检索前 9000 条是唯一的，因此扫描行数 = 9000 + 9000*1 后 1000 条记录的 a 都相同，因此扫描行数= 1000 + 1000 * 1000 = 10100 总行数=18000 + 10100 = 1019000 优化方案1：哈希连接，只用扫描 20000 行 mysql> select count(*) from t_jointab t1 ignore index(idx_a), t_jointab t2 ignore index(idx_a) where t1.a = t2.a and t1.b = t2.b; 1 row in set (0.01 sec) 优化方案2：修改索引 idx_a，在字段 a,b 上建立一个联合索引，拿字段 a 和 b 去被驱动表索引查找，避免扫描额外数据以及避免回表和数据过滤 alter table t_jointab drop index idx_a, add index idx_a(a,b); mysql> select count(*) from t_jointab t1, t_jointab t2 where t1.a = t2.a and t1.b = t2.b; 1 row in set (0.03 sec)
展开
作者回复: 考虑得很全面了。这个场景下使用hash join，或者将索引idx_a(a)改成组合索引idx_a(a,b) 都能取得不错的效果。👍👍 虽然这里的例子是人为构造的，但是正式环境中有时也会遇到类似的问题。有时也把这种情况称为“数据倾斜”。
TheOne
2024-10-10 来自北京
思考题，有点想不明白
作者回复: 先看下SQL的执行计划，是一个嵌套循环。 mysql> explain format=tree select count(*) from t_jointab t1 , t_jointab t2 where t1.a = t2.a and t1.b = t2.b\G *************************** 1. row *************************** EXPLAIN: -> Aggregate: count(0) (cost=4307.55 rows=1) -> Nested loop inner join (cost=4221.75 rows=858) -> Index scan on t1 using idx_ab (cost=1218.75 rows=8580) -> Filter: (t2.b = t1.b) (cost=0.25 rows=0.1) -> Index lookup on t2 using idx_a (a=t1.a) (cost=0.25 rows=1) 这个执行计划，用伪代表示，大概是这样的： def get_count(): cnt = 0 for t1_row in (select * from t_jointab t1): for t2_row in (select * from t_jointab t2 where a = t1_row.a and b = t1_row.b): cnt = cnt + 1 return cnt 由于t1.a = 9000的时候，使用索引idx_a(a)能匹配到1000行数据，而且有1000条t1.a = 9000的数据，因此循环的执行次数比较多。叶明同学的留言中，给出了解决这个问题的两种有效的方法。
共 2 条评论