处理mysql复制故障一例
今天一个主从复制的服务器出现了问题,数据和主服务器不一样,但是复制进程是对的。开始我只看了复制进程,没有太注意,原始内容如下:mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.125.9
Master_User: slave
Master_Port: 9188
Connect_Retry: 10
Master_Log_File: mysql-bin.000004
Read_Master_Log_Pos: 805036116
Relay_Log_File: localhost-relay-bin.000018
Relay_Log_Pos: 192473825
Relay_Master_Log_File: mysql-bin.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 197587108
Relay_Log_Space: 805036869
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 25236
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec) 后来发现数据不对,发现
Read_Master_Log_Pos和Exec_Master_Log_Pos真么差别这么大,原理上,应该是挨着的, 然后开始找问题,把日志翻来覆去的看,没有发现任何问题,然后看了一下processlist: ,发现延迟了5个多小时,我的个娘耶。
mysql> show processlist\G
*************************** 1. row ***************************
Id: 1
User: root
Host: localhost
db: NULL
Command: Query
Time: 0
State: NULL
Info: show processlist
*************************** 2. row ***************************
Id: 2
User: system user
Host:
db: NULL
Command: Connect
Time: 1038
State: Waiting for master to send event
Info: NULL
*************************** 3. row ***************************
Id: 3
User: system user
Host:
db: NULL
Command: Connect
Time: 18697
State: freeing items
Info: NULL
3 rows in set (0.00 sec) 基本确认主库和从库是延迟,而不是认为或者其他bug之类的东西。
开始找延迟的原因吧:
1、检查网络延时:
# ping 192.168.125.9
PING 192.168.125.9 (192.168.125.9) 56(84) bytes of data.
64 bytes from 192.168.125.9: icmp_seq=1 ttl=64 time=0.555 ms
64 bytes from 192.168.125.9: icmp_seq=2 ttl=64 time=0.588 ms
64 bytes from 192.168.125.9: icmp_seq=3 ttl=64 time=0.635 ms
64 bytes from 192.168.125.9: icmp_seq=4 ttl=64 time=0.588 ms
64 bytes from 192.168.125.9: icmp_seq=5 ttl=64 time=0.586 ms 没有发现问题。
2、检查系统资源
# iostat -dx 1 5
Linux 2.6.18-164.el5PAE (localhost.localdomain) 11/18/2011
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz awaitsvctm%util
sda 0.07 40.800.22 123.53 6.061314.72 10.67 0.98 7.92 2.7433.97
sda1 0.01 0.000.000.00 0.02 0.00 18.93 0.00 2.33 1.60 0.00
sda2 0.00 0.000.000.00 0.01 0.00 36.13 0.00 3.65 2.81 0.00
sda3 0.02 0.010.000.01 0.02 0.16 20.21 0.00 5.36 2.71 0.00
sda4 0.00 0.000.000.00 0.00 0.00 2.00 0.00 21.0021.00 0.00
sda5 0.04 40.790.22 123.52 6.011314.56 10.67 0.98 7.92 2.7433.96
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz awaitsvctm%util
sda 0.00 156.002.00 365.00 16.004160.00 11.38 3.15 8.64 2.6898.30
sda1 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda3 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda4 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 156.002.00 365.00 16.004160.00 11.38 3.15 8.64 2.6898.30
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz awaitsvctm%util
sda 0.00 112.000.00 315.00 0.003448.00 10.95 2.52 8.06 3.1398.70
sda1 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda3 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda4 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 112.000.00 315.00 0.003448.00 10.95 2.52 8.06 3.1398.70
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz awaitsvctm%util
sda 0.00 103.960.00 195.05 0.002392.08 12.26 3.12 15.47 5.0698.71
sda1 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda3 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda4 0.00 0.000.000.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 103.960.00 195.05 0.002392.08 12.26 3.12 15.47 5.0698.71 发现有IO,但是不太厉害,算正常。
然后看CPU,使用top命令
Tasks: 141 total, 1 running, 139 sleeping, 1 stopped, 0 zombie
Cpu0:0.0%us,0.0%sy,0.0%ni,100.0%id,0.0%wa,0.0%hi,0.0%si,0.0%st
Cpu1:0.3%us,0.3%sy,0.0%ni, 99.3%id,0.0%wa,0.0%hi,0.0%si,0.0%st
Cpu2:0.0%us,0.0%sy,0.0%ni,100.0%id,0.0%wa,0.0%hi,0.0%si,0.0%st
Cpu3:0.0%us,4.7%sy,0.0%ni, 94.7%id,0.0%wa,0.0%hi,0.7%si,0.0%st
Cpu4:0.0%us,1.3%sy,0.0%ni,1.0%id, 97.3%wa,0.3%hi,0.0%si,0.0%st
Cpu5:0.0%us,0.3%sy,0.0%ni, 99.7%id,0.0%wa,0.0%hi,0.0%si,0.0%st
Cpu6:0.0%us,0.0%sy,0.0%ni, 91.3%id,0.0%wa,3.7%hi,5.0%si,0.0%st
Cpu7:0.3%us,7.0%sy,0.0%ni, 81.8%id,0.0%wa,1.0%hi,9.9%si,0.0%st
Mem: 4139196k total,2605768k used,1533428k free, 215052k buffers
Swap:8385920k total, 0k used,8385920k free,2143992k cached 发现
Cpu4:0.0%us,1.3%sy,0.0%ni,1.0%id, 97.3%wa,0.3%hi,0.0%si,0.0%st这个核有点问题 wait太高的,话,应该是cpu在等待IO资源,导致的。 但是IO资源又不太高,才几M的写入。 不过现在也没办法,只有想法降低mysql的写入来试试看。我关闭了slave更新bin-log的功能 #log-slave-updates然后重启mysql,启动slave进程。现在Slave_SQL进程速度加快了,明显看见 Exec_Master_Log_Pos数增加,show processlist中的时间也在缩短, 过了有20多分钟,数据同步完成了。通过maatkit工具检查,没有问题。故障现在是修复了,不过为什么mysql的IO那么低,却会导致IO资源缺乏,我尝试通过写文件来测试磁盘IO,应该可以达到130多M/s。 这个问题现在比较诡异了,估计是mysql slave只能使用单核导致的,而这台服务器虽然是8核的,但是单核的主频只有2.13,中继sql,执行sql,写入自己的blog,IO请求都在一个核上,导致的资源不足。
在补充一下,这个机器业务空闲后,我做了一个基准测试,发现效率还是上不去。找了硬件的原因,也怀疑过编译参数,在相同硬件平台和mysql版本做了相关测试。最后发现这次性能问题的最终杀手是my.cnf的一个配置。这次心血来潮加了如下一个参数
sync_binlog=1//意思是即时同步binlog到硬盘上,不缓存日志,目的是避免硬件故障或者软件故障导致binlog没有即时写盘而丢失。但是开始没想到这个参数在QPS上到100左右后,性能消耗是如此的高,导致整个CPU核都在等待IO,建议大家以后不是特别需要,还是别碰这个参数了,默认是0,由操作系统来调度什么时候写入到硬盘,或者将值调整到比较大。
还有一个就是innodb引擎的
innodb_flush_log_at_trx_commit = N
N=0– 每隔一秒,把事务日志缓存区的数据写到日志文件中,以及把日志文件的数据刷新到磁盘上;
N=1– 每个事务提交时候,把事务日志从缓存区写到日志文件中,并且刷新日志文件的数据到磁盘上;
N=2– 每事务提交的时候,把事务日志数据从缓存区写到日志文件中;每隔一秒,刷新一次日志文件,但不一定刷新到磁盘上,而是取决于操作系统的调度;
如果设置为1,效率也会相当的低,建议设置到2,如果想要性能再提交,可以设置为0.
具体可以参见如下文章:http://www.mysqlops.com/2011/10/26/mysql-variable-third.html
页:
[1]