写点什么

MySQL 运行时的可观测性

作者:GreatSQL
  • 2023-08-21
    福建
  • 本文字数:7539 字

    阅读完需:约 25 分钟

  • 1.说在前面的话

  • 2.安装 employees 测试库

  • 3.观测 SQL 运行状态

  • 3.1 观测 SQL 运行时的内存消耗

  • 3.2 观测 SQL 运行时的其他开销

  • 3.3 观测 SQL 运行进度


感知 SQL 运行时的状态

1. 说在前面的话

在 MySQL 里,一条 SQL 运行时产生多少磁盘 I/O,占用多少内存,是否有创建临时表,这些指标如果都能观测到,有助于更快发现 SQL 瓶颈,扑灭潜在隐患。


从 MySQL 5.7 版本开始,performance_schema就默认启用了,并且还增加了sys schema,到了 8.0 版本又进一步得到增强提升,在 SQL 运行时就能观察到很多有用的信息,实现一定程度的可观测性。


下面举例说明如何进行观测,以及主要观测哪些指标。

2. 安装 employees 测试库

安装 MySQL 官方提供的employees测试数据库,戳此链接(https://dev.mysql.com/doc/index-other.html)下载,解压缩后开始安装:


$ mysql -f < employees.sql;
INFOCREATING DATABASE STRUCTUREINFOstorage engine: InnoDBINFOLOADING departmentsINFOLOADING employeesINFOLOADING dept_empINFOLOADING dept_managerINFOLOADING titlesINFOLOADING salariesdata_load_time_diff00:00:37
复制代码


MySQL 还提供了相应的使用文档:https://dev.mysql.com/doc/employee/en/


本次测试采用 GreatSQL 8.0.32-24 版本,且运行在 MGR 环境中:


greatsql> \s...Server version:         8.0.32-24 GreatSQL, Release 24, Revision 3714067bc8c...
greatsql> select MEMBER_ID, MEMBER_ROLE, MEMBER_VERSION from performance_schema.replication_group_members;+--------------------------------------+-------------+----------------+| MEMBER_ID | MEMBER_ROLE | MEMBER_VERSION |+--------------------------------------+-------------+----------------+| 2adec6d2-febb-11ed-baca-d08e7908bcb1 | SECONDARY | 8.0.32 || 2f68fee2-febb-11ed-b51e-d08e7908bcb1 | ARBITRATOR | 8.0.32 || 5e34a5e2-feb6-11ed-b288-d08e7908bcb1 | PRIMARY | 8.0.32 |+--------------------------------------+-------------+----------------+
复制代码

3. 观测 SQL 运行状态

查看当前连接/会话的连接 ID、内部线程 ID:


greatsql> select processlist_id, thread_id from performance_schema.threads where processlist_id = connection_id();+----------------+-----------+| processlist_id | thread_id |+----------------+-----------+|            110 |       207 |+----------------+-----------+
复制代码


查询得到当前的连接 ID=110,内部线程 ID=207。


P.S,由于本文整理过程不是连续的,所以下面看到的 thread_id 值可能会有好几个,每次都不同。

3.1 观测 SQL 运行时的内存消耗

执行下面的 SQL,查询所有员工的薪资总额,按员工号分组,并按薪资总额倒序,取前 10 条记录:


greatsql> explain select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10\G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: salaries   partitions: NULL         type: indexpossible_keys: PRIMARY          key: PRIMARY      key_len: 7          ref: NULL         rows: 2838426     filtered: 100.00        Extra: Using temporary; Using filesort
复制代码


看到需要全索引扫描(其实也等同于全表扫描,因为是基于 PRIMARY 索引),并且还需要生成临时表,以及额外的 filesort。


在正式运行该 SQL 之前,在另外的窗口中新建一个连接会话,执行下面的 SQL 先观察该连接/会话当前的内存分配情况:


greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G*************************** 1. row ***************************         thread_id: 207              user: root@localhostcurrent_count_used: 9 current_allocated: 26266 current_avg_alloc: 2918.4444 current_max_alloc: 16464   total_allocated: 30311
复制代码


等到该 SQL 执行完了,再一次查询内存分配情况:


greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G*************************** 1. row ***************************         thread_id: 207              user: root@localhostcurrent_count_used: 13 current_allocated: 24430 current_avg_alloc: 1879.2308 current_max_alloc: 16456   total_allocated: 95719
复制代码


我们注意到几个数据的变化情况,用下面表格来展示:



也就是说,SQL 运行时,需要分配的内存是:95719 - 30311 = 65408 字节。

3.2 观测 SQL 运行时的其他开销

通过观察 performance_schema.status_by_thread 表,可以知道相应连接/会话中 SQL 运行的一些状态指标。在 SQL 运行结束后,执行下面的 SQL 命令即可查看:


greatsql> select * from performance_schema.status_by_thread where thread_id = 207;...|       207 | Created_tmp_disk_tables             | 0                        ||       207 | Created_tmp_tables                  | 0                        |...|       207 | Handler_read_first                  | 1                        ||       207 | Handler_read_key                    | 1                        ||       207 | Handler_read_last                   | 0                        ||       207 | Handler_read_next                   | 2844047                  ||       207 | Handler_read_prev                   | 0                        ||       207 | Handler_read_rnd                    | 0                        ||       207 | Handler_read_rnd_next               | 0                        ||       207 | Handler_rollback                    | 0                        ||       207 | Handler_savepoint                   | 0                        ||       207 | Handler_savepoint_rollback          | 0                        ||       207 | Handler_update                      | 0                        ||       207 | Handler_write                       | 0                        ||       207 | Last_query_cost                     | 286802.914893            ||       207 | Last_query_partial_plans            | 1                        |...|       207 | Select_full_join                    | 0                        ||       207 | Select_full_range_join              | 0                        ||       207 | Select_range                        | 0                        ||       207 | Select_range_check                  | 0                        ||       207 | Select_scan                         | 1                        ||       207 | Slow_launch_threads                 | 0                        ||       207 | Slow_queries                        | 1                        ||       207 | Sort_merge_passes                   | 0                        ||       207 | Sort_range                          | 0                        ||       207 | Sort_rows                           | 1                       ||       207 | Sort_scan                           | 1                        |...
复制代码


上面我们只罗列了部分比较重要的状态指标。从这个结果也可以佐证 slow query log 中的结果,确实没创建临时表。


作为参照,查看这条 SQL 对应的 slow query log 记录:


# Query_time: 0.585593  Lock_time: 0.000002 Rows_sent: 10  Rows_examined: 2844057 Thread_id: 110 Errno: 0 Killed: 0 Bytes_received: 115 Bytes_sent: 313 Read_first: 1 Read_last: 0 Read_key: 1 Read_next: 2844047 Read_prev: 0 Read_rnd: 0 Read_rnd_next: 0 Sort_merge_passes: 0 Sort_range_count: 0 Sort_rows: 10 Sort_scan_count: 1 Created_tmp_disk_tables: 0 Created_tmp_tables: 0 Start: 2023-07-06T10:06:01.438376+08:00 End: 2023-07-06T10:06:02.023969+08:00 Schema: employees Rows_affected: 0# Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0# InnoDB_trx_id: 0# Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0#   InnoDB_IO_r_ops: 0  InnoDB_IO_r_bytes: 0  InnoDB_IO_r_wait: 0.000000#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000#   InnoDB_pages_distinct: 4281use employees;SET timestamp=1688609161;select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10;
复制代码


可以看到,Created_tmp_disk_tables, Created_tmp_tables, Handler_read_next, Select_full_join, Select_scan, Sort_rows, Sort_scan, 等几个指标的数值是一样的。


还可以查看该 SQL 运行时的 I/O latency 情况,SQL 运行前后两次查询对比:


greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207;+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+| user           | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id |+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+| root@localhost |     7 | 75.39 us      | 5.84 us     | 10.77 us    | 22.12 us    |       207 |            110 |+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
...
greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207;+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+| user | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id |+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+| root@localhost | 8 | 85.29 us | 5.84 us | 10.66 us | 22.12 us | 207 | 110 |+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
复制代码


可以看到这个 SQL 运行时的 I/O latency 是:85.29 - 75.39 = 9.9us。

3.3 观测 SQL 运行进度

我们知道,运行完一条 SQL 后,可以利用 PROFLING 功能查看它各个阶段的耗时,但是在运行时如果也想查看各阶段耗时该怎么办呢?


从 MySQL 5.7 版本开始,可以通过 performance_schema.events_stages_% 相关表查看 SQL 运行过程以及各阶段耗时,需要先修改相关设置:


# 确认是否对所有主机&用户都启用greatsql> SELECT * FROM performance_schema.setup_actors;+------+------+------+---------+---------+| HOST | USER | ROLE | ENABLED | HISTORY |+------+------+------+---------+---------+| %    | %    | %    | NO      | NO      |+------+------+------+---------+---------+
# 修改成对所有主机&用户都启用greatsql> UPDATE performance_schema.setup_actors SET ENABLED = 'YES', HISTORY = 'YES' WHERE HOST = '%' AND USER = '%'; # 修改 setup_instruments & setup_consumers 设置greatsql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%events_statements_%'; greatsql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%events_stages_%';
复制代码


这就实时可以观测 SQL 运行过程中的状态了。


在 SQL 运行过程中,从另外的窗口查看该 SQL 对应的 EVENT_ID


greatsql> SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT        FROM performance_schema.events_statements_history WHERE thread_id = 85 order by event_id desc limit 5;+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+| EVENT_ID | Duration | SQL_TEXT                                                                                                                      |+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+|   149845 |   0.6420 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 ||   149803 |   0.6316 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 ||   149782 |   0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 ||   149761 |   0.6361 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 ||   149740 |   0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+
# 再根据 EVENT_ID 值去查询 events_stages_history_longgreatsql> SELECT thread_id ,event_Id, event_name AS Stage, TRUNCATE(TIMER_WAIT/1000000000000,6) AS Duration FROM performance_schema.events_stages_history_long WHERE NESTING_EVENT_ID = 149845 order by event_id;+-----------+----------+------------------------------------------------+----------+| thread_id | event_Id | Stage | Duration |+-----------+----------+------------------------------------------------+----------+| 85 | 149846 | stage/sql/starting | 0.0000 || 85 | 149847 | stage/sql/Executing hook on transaction begin. | 0.0000 || 85 | 149848 | stage/sql/starting | 0.0000 || 85 | 149849 | stage/sql/checking permissions | 0.0000 || 85 | 149850 | stage/sql/Opening tables | 0.0000 || 85 | 149851 | stage/sql/init | 0.0000 || 85 | 149852 | stage/sql/System lock | 0.0000 || 85 | 149854 | stage/sql/optimizing | 0.0000 || 85 | 149855 | stage/sql/statistics | 0.0000 || 85 | 149856 | stage/sql/preparing | 0.0000 || 85 | 149857 | stage/sql/Creating tmp table | 0.0000 || 85 | 149858 | stage/sql/executing | 0.6257 || 85 | 149859 | stage/sql/end | 0.0000 || 85 | 149860 | stage/sql/query end | 0.0000 || 85 | 149861 | stage/sql/waiting for handler commit | 0.0000 || 85 | 149862 | stage/sql/closing tables | 0.0000 || 85 | 149863 | stage/sql/freeing items | 0.0000 || 85 | 149864 | stage/sql/logging slow query | 0.0000 || 85 | 149865 | stage/sql/cleaning up | 0.0000 |+-----------+----------+------------------------------------------------+----------+
复制代码


上面就是这条 SQL 的运行进度展示,以及各个阶段的耗时,和PROFILING的输出一样,当我们了解一条 SQL 运行所需要经历的各个阶段时,从上面的输出结果中也就能估算出该 SQL 大概还要多久能跑完,决定是否要提前 kill 它。


如果想要观察 DDL SQL 的运行进度,可以参考这篇文章:不用MariaDB/Percona也能查看DDL的进度


更多的观测指标、维度还有待继续挖掘,以后有机会再写。


另外,也可以利用 MySQL Workbench 工具,或 MySQL Enterprise Monitor,都已集成了很多可观测性指标,相当不错的体验。


延伸阅读


  • Query Profiling Using Performance Schema, https://dev.mysql.com/doc/refman/8.0/en/performance-schema-query-profiling.html

  • 不用 MariaDB/Percona 也能查看 DDL 的进度,https://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653931466&idx=1&sn=8c14c7a6449205b61990a567a1be315e&scene=21#wechat_redirect

  • 事件记录 | performance_schema 全方位介绍,http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653935075&idx=3&sn=85be3a5bbe6be8bacd2a88e22ab95b5c&chksm=bd3b4f898a4cc69fb303f977912f4ee4a399ab96b72b8d994e262b1095edad4c7e2e6f45dc9b&scene=21#wechat_redirect

  • 内存分配统计视图 | 全方位认识 sys 系统库,http://mp.weixin.qq.com/s?__biz=MjM5NzAzMTY4NQ==&mid=2653935156&idx=6&sn=3061ca49bdefd61eca1fe9b3c52097a1&chksm=bd3b4c5e8a4cc5480e6c27dc42a68837a113abea4279b4ae492078f263cb65421c99dc39d978&scene=21#wechat_redirect

发布于: 2023-08-21阅读数: 13
用户头像

GreatSQL

关注

GreatSQL社区 2023-01-31 加入

GreatSQL是由万里数据库维护的MySQL分支,专注于提升MGR可靠性及性能,支持InnoDB并行查询特性,是适用于金融级应用的MySQL分支版本。 社区:https://greatsql.cn/ Gitee: https://gitee.com/GreatSQL/GreatSQL

评论

发布
暂无评论
MySQL运行时的可观测性_数据库_GreatSQL_InfoQ写作社区