Atlas使用问题汇总

By | 2014/06/04

1.安装问题
源码编译Atlas新版
注:2.1版的编码问题解决了, /master/方式通过php,java等完成访问(命令行方式不能转到master主机上).

yum install -y libevent-devel lua-devel openssl-devel flex mysql-devel
yum install gcc.x86_64 gcc-c++.x86_64

glibc install:
  391  wget http://pkgs.fedoraproject.org/repo/pkgs/glib2/glib-2.32.4.tar.xz/bf84fefd9c1a5b5a7a38736f4ddd674a/glib-2.32.4.tar.xz
  393  tar xf glib-2.32.4.tar.xz 
  395  cd glib-2.32.4
  397  mkdir /usr/local/glib/
  398  ./configure --prefix=/usr/local/glib
  401  yum install libffi-devel.x86_64
  403  ./configure --prefix=/usr/local/glib
  405  make
  406  make install 

checking for GLIB... configure: error: Package requirements (glib-2.0 >= 2.32.0) were not met:
checking for GMODULE... configure: error: Package requirements (gmodule-2.0 >= 2.32.0) were not met:

[root@atlas Atlas-master]# cat bootstrap.sh

#!/bin/sh 
base=$(cd "$(dirname "$0")"; pwd)
cd $base
PKG_CONFIG_PATH=/usr/local/glib/lib/pkgconfig ./configure --with-mysql=/usr --prefix=/opt/mysql-proxy 

#make && make install

2.延迟处理
见: https://github.com/Qihoo360/Atlas/issues/8#issuecomment-38153303
使用中间件做read负载, 最核心的是解决延迟问题,应用端可以通过/master/方式强制指定从master获取信息,不过这不能从本质上解决问题,使用Atlas需要考虑到这点, 提高主机的IO性能,加大memory可以缓解延迟症状,但依旧不能避免延迟的出现,尤其是读多写少的应用。

Atlas没有做主从状态的检测, 在SQL_THREAD为No情况下,sql的执行失败(在延迟情况下不管是空结果集还是表不存在)都不会使slave在backends下线,不过可以通过管理接口来实现offline出问题的slave,添加命令到监控项可实现自动下线。比如:

mysql -h 10.3.254.106 -P 2345 -u admin -pxxxxxxxx -e 'SET OFFLINE 3'

下线功能或许可能在未来的版本实现,目前可以通过daemon程序来频繁检测slave的延迟情况, 延迟较大(如检测两次,每次都大于10s)则下线slave。

https://github.com/arstercz/Atlas_auto_setline

# atlas auto set line.
* * * * * cd /web/scripts/atlas_auto; perl atlas_auto_setline.pl --conf=db.conf --verbose --setline >>setline.log 2>&1

3.thread超时退出
MySQL为percona 5.1.34版本

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111463 | Waiting for master to send event | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1342 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       76 |
+----------+
1 row in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111490 | Waiting for master to send event | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1369 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1252 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       74 |
+----------+
1 row in set (0.00 sec)

为interactive_timeout参数的值,见:

| interactive_timeout        | 1500  |
| wait_timeout               | 600   |
On thread startup, the session wait_timeout value is initialized from the global wait_timeout value or from the global interactive_timeout value, depending on the type of client (as defined by the CLIENT_INTERACTIVE connect option to mysql_real_connect()). See also interactive_timeout.

即Mysql 初始化wait_timeout的值根据CLIENT_INTERACTIVE(应用端交互类型)选取是wait_timeout还是interactive_timeout。 php应用采用mysql_real_connect函数连接(不一定严格使用该名字的函数,只要底层使用了CLIENT_INTERACTIVE类型;比如mysqli驱动连接分为mysqli::real_connect — mysqli_real_connect — Opens a connection to a mysql server和mysqli::__construct — mysqli_connect — Open a new connection to the MySQL server两种,前者连接前需要初始化一个对象,后者为单纯的新建连接,这个时候的初始值为wait_timeout),则选取interactive_timeout值作为初始值见(interactive_timeout选项): An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect()。

4.resource limit问题
atlas有自己的连接池,会吃掉很多CPU, php应用端改用短链接来连接atlas, 这时候atlas对php发送来的sql只负责验证和转发的操作,后端DB的连接由atlas自己管理,未使用的连接线程进行剔除操作(DB的wait_timeout和interactive_timeout设置为300s,超时亦退出)。

2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:29: (critical) last message repeated 5 times
2014-04-12 20:56:29: (critical) network-conn-pool-lua.c.144: socket() failed: Too many open files (24)
2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:30: (debug) chassis-unix-daemon.c:168: 12951 returned: 12951
2014-04-12 20:56:30: (critical) chassis-unix-daemon.c:196: [angel] PID=12951 died on signal=11 (it used 16 kBytes max) ... waiting 3min before restart
如果MySQL后端的连接数也满了, 或者atlas 连不上后端的数据库都可能会报以下错误:
2014-11-13 12:21:07: (critical) network_mysqld_proto_password_scramble: assertion `20 == challenge_len' failed
2014-11-13 12:21:07: (warning) (libevent) event_del: event has no event_base set.
2014-11-13 12:21:07: (critical) 

可以临时增加MySQL connection数量:
echo -n “Max processes=SOFT_LIMIT:HARD_LIMIT” > /proc/pidof mysqld/limits

关于Too many open files错误,可能由两种情况引起:
一、php长连接连接到atlas后,每个线程占用一个FD,直到超出系统资源限制而出现too many错误;
二、php应用端发送到atlas的sql过多,大量并发的情况下,linevent维护的队列过多,每个event吃一个FD,超出系统资源限制引起too many错误;

避免too many错误,增加用户的ulimit值加大FD的使用量,可增加系统ulimit 资源到 ~/.bash_profile文件或/etc/security/limits.conf文件:

# cat .bash_profile 
# .bash_profile
...
...
export PATH
ulimit -n 16384

5. OOM问题
mysql-proxy进程占用系统资源过多, 由于系统评分机制的原因, atlas进程被系统杀掉, 比如以下错误:

May  3 09:55:31 cz kernel: Out of memory: Kill process 45330 (mysql-proxy) score 1 or sacrifice child
May  3 09:55:31 cz kernel: Killed process 45331, UID 0, (mysql-proxy) total-vm:4917192kB, anon-rss:177720kB, file-rss:40kB
May  3 09:55:31 cz kernel: mysql-proxy: page allocation failure. order:0, mode:0x201da
May  3 09:55:31 cz kernel: Pid: 45331, comm: mysql-proxy Not tainted 2.6.32-431.5.1.el6.x86_64 #1
May  3 09:55:31 cz kernel: Call Trace:
May  3 09:55:31 cz kernel: [] ? __alloc_pages_nodemask+0x757/0x8d0
May  3 09:55:31 cz kernel: [] ? alloc_pages_current+0xaa/0x110
May  3 09:55:31 cz kernel: [] ? __page_cache_alloc+0x87/0x90
May  3 09:55:31 cz kernel: [] ? find_get_page+0x1e/0xa0
May  3 09:55:31 cz kernel: [] ? filemap_fault+0x1a7/0x500
May  3 09:55:31 cz kernel: [] ? __do_fault+0x54/0x530
May  3 09:55:31 cz kernel: [] ? handle_pte_fault+0xf7/0xb00
May  3 09:55:31 cz kernel: [] ? thread_return+0x4e/0x76e
May  3 09:55:31 cz kernel: [] ? __remove_hrtimer+0x3b/0xb0
May  3 09:55:31 cz kernel: [] ? lock_hrtimer_base+0x31/0x60
May  3 09:55:31 cz kernel: [] ? hrtimer_try_to_cancel+0x3f/0xd0
May  3 09:55:31 cz kernel: [] ? handle_mm_fault+0x22a/0x300
May  3 09:55:31 cz kernel: [] ? __do_page_fault+0x138/0x480
May  3 09:55:31 cz kernel: [] ? ep_poll+0x12e/0x330
May  3 09:55:31 cz kernel: [] ? default_wake_function+0x0/0x20
May  3 09:55:31 cz kernel: [] ? do_page_fault+0x3e/0xa0
May  3 09:55:31 cz kernel: [] ? page_fault+0x25/0x30

鉴于这种情况, 即便atlas做了高可用相关的架构, 在切换到另一台atlas之后也可能很快的被系统 OOM 掉, 可以临时禁止atlas进程参与系统评分, 尽量避免被系统杀掉.

for x in `pidof mysql-proxy`; do  echo '-17' > /proc/$x/oom_adj; done

9 thoughts on “Atlas使用问题汇总

  1. 苏幕遮

    博主,你好,我遇到你文中说到的几个问题,但是没有得到解决,前来打扰您。

    公司使用的阿里云的数据库,然后搭配自己搭建的atlas,现在情况是atlas和数据库的链接是没问题的。但是在阿里云的监控端发现读库的“当前连接数”始终是0,而主库是有数值的。打开atals查看日志就看到这样的东西:

    2017-03-28 10:41:04: (message) chassis-unix-daemon.c:138: [angel] we try to keep PID=31136 alive
    2017-03-28 10:41:04: (message) mysql-proxy 0.8.2 started – instance: openhls
    2017-03-28 10:41:04: (message) proxy listening on port 0.0.0.0:3307
    2017-03-28 10:41:04: (message) added read/write backend: rm-bp140cd7xf48273m8.mysql.rds.aliyuncs.com:3306
    2017-03-28 10:41:04: (message) added read-only backend: rr-bp1x2v62r2558089s.mysql.rds.aliyuncs.com:3306
    2017-03-28 10:41:04: (message) chassis-event-thread.c:235: starting 8 threads
    2017-03-28 10:41:05: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:41:36: (warning) last message repeated 64 times
    2017-03-28 10:41:36: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:42:07: (warning) last message repeated 61 times
    2017-03-28 10:42:07: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:42:38: (warning) last message repeated 61 times
    2017-03-28 10:42:38: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:43:09: (warning) last message repeated 69 times
     

    看到已经发现了两个数据库,但是却不断地在刷(warning) (libevent) event_del: event has no event_base set.我把读库和主库的wait_timeout和interactive_timeout设置为300s(默认值是86400和7200),但是效果没变,请问怎么破?

    Reply
    1. admin Post author

      后端的连接数满了吗? 参考文章的 resource limit 部分.

      Reply
      1. 苏幕遮

        阿里云2G的mysql 5.6最大连接数600(这个不能改,只能通过提升配置来改变),但是我们的连接数远远不足,还不到40…

        您这个只能传网络图片,无法上传本地图片…

        Reply
        1. admin Post author

          atlas 的配置有吗? 可以简单贴出来, 不要贴密码信息

          Reply
      2. admin Post author

        确保 atlas 能连接到读库上, 连不上读库也会出这个问题, 你可以手工测测连接.

        Reply
        1. 苏幕遮

          博主,我向开发询问他们查询数据库的语句,确认他们使用了事务,会不会是这个原因?

          Reply
          1. admin Post author

            程序里显式指明了事务都会被 atlas 转到后端的 master 里, 不会路由到 slave,  你自己可以手动连接 atlas, 不要使用事务, 看看有没有 sql 转到 slave 里.  另外, 程序是否使用事务和你上面的报错没多少关系. 你在上面列出的错更像是 atlas 连不上 slave.

  2. 苏幕遮

    因为我在模块端测试atlas是可以登陆的。如果可以的话,方便交换一下QQ或者电邮么?我可以截图到本地传给你。

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *