Atlas使用问题汇总

1.安装问题
源码编译Atlas新版
注:2.1版的编码问题解决了, /master/方式通过php,java等完成访问(命令行方式不能转到master主机上).

yum install -y libevent-devel lua-devel openssl-devel flex mysql-devel
yum install gcc.x86_64 gcc-c++.x86_64

glibc install:
  391  wget http://pkgs.fedoraproject.org/repo/pkgs/glib2/glib-2.32.4.tar.xz/bf84fefd9c1a5b5a7a38736f4ddd674a/glib-2.32.4.tar.xz
  393  tar xf glib-2.32.4.tar.xz 
  395  cd glib-2.32.4
  397  mkdir /usr/local/glib/
  398  ./configure --prefix=/usr/local/glib
  401  yum install libffi-devel.x86_64
  403  ./configure --prefix=/usr/local/glib
  405  make
  406  make install 

checking for GLIB... configure: error: Package requirements (glib-2.0 >= 2.32.0) were not met:
checking for GMODULE... configure: error: Package requirements (gmodule-2.0 >= 2.32.0) were not met:

[root@atlas Atlas-master]# cat bootstrap.sh

#!/bin/sh 
base=$(cd "$(dirname "$0")"; pwd)
cd $base
PKG_CONFIG_PATH=/usr/local/glib/lib/pkgconfig ./configure --with-mysql=/usr --prefix=/opt/mysql-proxy 

#make && make install

2.延迟处理
见: https://github.com/Qihoo360/Atlas/issues/8#issuecomment-38153303
使用中间件做read负载, 最核心的是解决延迟问题,应用端可以通过/master/方式强制指定从master获取信息,不过这不能从本质上解决问题,使用Atlas需要考虑到这点, 提高主机的IO性能,加大memory可以缓解延迟症状,但依旧不能避免延迟的出现,尤其是读多写少的应用。

Atlas没有做主从状态的检测, 在SQL_THREAD为No情况下,sql的执行失败(在延迟情况下不管是空结果集还是表不存在)都不会使slave在backends下线,不过可以通过管理接口来实现offline出问题的slave,添加命令到监控项可实现自动下线。比如:

mysql -h 10.3.254.106 -P 2345 -u admin -pxxxxxxxx -e 'SET OFFLINE 3'

下线功能或许可能在未来的版本实现,目前可以通过daemon程序来频繁检测slave的延迟情况, 延迟较大(如检测两次,每次都大于10s)则下线slave。

https://github.com/arstercz/Atlas_auto_setline

# atlas auto set line.
* * * * * cd /web/scripts/atlas_auto; perl atlas_auto_setline.pl --conf=db.conf --verbose --setline >>setline.log 2>&1

3.thread超时退出
MySQL为percona 5.1.34版本

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111463 | Waiting for master to send event | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1473 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1356 |                                  | NULL |
| Sleep   |   1342 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1242 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
| Sleep   |   1241 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       76 |
+----------+
1 row in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select command,time,state,info from processlist order by time desc limit 10;
+---------+--------+----------------------------------+------+
| command | time   | state                            | info |
+---------+--------+----------------------------------+------+
| Connect | 111490 | Waiting for master to send event | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1383 |                                  | NULL |
| Sleep   |   1369 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1269 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1268 |                                  | NULL |
| Sleep   |   1252 |                                  | NULL |
+---------+--------+----------------------------------+------+
10 rows in set (0.00 sec)

mysql root@[localhost:s3306 information_schema] > select count(*) from processlist;                                           
+----------+
| count(*) |
+----------+
|       74 |
+----------+
1 row in set (0.00 sec)

为interactive_timeout参数的值,见:

| interactive_timeout        | 1500  |
| wait_timeout               | 600   |
On thread startup, the session wait_timeout value is initialized from the global wait_timeout value or from the global interactive_timeout value, depending on the type of client (as defined by the CLIENT_INTERACTIVE connect option to mysql_real_connect()). See also interactive_timeout.

即Mysql 初始化wait_timeout的值根据CLIENT_INTERACTIVE(应用端交互类型)选取是wait_timeout还是interactive_timeout。 php应用采用mysql_real_connect函数连接(不一定严格使用该名字的函数,只要底层使用了CLIENT_INTERACTIVE类型;比如mysqli驱动连接分为mysqli::real_connect — mysqli_real_connect — Opens a connection to a mysql server和mysqli::__construct — mysqli_connect — Open a new connection to the MySQL server两种,前者连接前需要初始化一个对象,后者为单纯的新建连接,这个时候的初始值为wait_timeout),则选取interactive_timeout值作为初始值见(interactive_timeout选项): An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect()。

4.resource limit问题
atlas有自己的连接池,会吃掉很多CPU, php应用端改用短链接来连接atlas, 这时候atlas对php发送来的sql只负责验证和转发的操作,后端DB的连接由atlas自己管理,未使用的连接线程进行剔除操作(DB的wait_timeout和interactive_timeout设置为300s,超时亦退出)。

2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:29: (critical) last message repeated 5 times
2014-04-12 20:56:29: (critical) network-conn-pool-lua.c.144: socket() failed: Too many open files (24)
2014-04-12 20:56:29: (warning) (libevent) event_del: event has no event_base set.
2014-04-12 20:56:30: (debug) chassis-unix-daemon.c:168: 12951 returned: 12951
2014-04-12 20:56:30: (critical) chassis-unix-daemon.c:196: [angel] PID=12951 died on signal=11 (it used 16 kBytes max) ... waiting 3min before restart
如果MySQL后端的连接数也满了, 或者atlas 连不上后端的数据库都可能会报以下错误:
2014-11-13 12:21:07: (critical) network_mysqld_proto_password_scramble: assertion `20 == challenge_len' failed
2014-11-13 12:21:07: (warning) (libevent) event_del: event has no event_base set.
2014-11-13 12:21:07: (critical) 

可以临时增加MySQL connection数量:
echo -n “Max processes=SOFT_LIMIT:HARD_LIMIT” > /proc/pidof mysqld/limits

关于Too many open files错误,可能由两种情况引起:
一、php长连接连接到atlas后,每个线程占用一个FD,直到超出系统资源限制而出现too many错误;
二、php应用端发送到atlas的sql过多,大量并发的情况下,linevent维护的队列过多,每个event吃一个FD,超出系统资源限制引起too many错误;

避免too many错误,增加用户的ulimit值加大FD的使用量,可增加系统ulimit 资源到 ~/.bash_profile文件或/etc/security/limits.conf文件:

# cat .bash_profile 
# .bash_profile
...
...
export PATH
ulimit -n 16384

5. OOM问题
mysql-proxy进程占用系统资源过多, 由于系统评分机制的原因, atlas进程被系统杀掉, 比如以下错误:

May  3 09:55:31 cz kernel: Out of memory: Kill process 45330 (mysql-proxy) score 1 or sacrifice child
May  3 09:55:31 cz kernel: Killed process 45331, UID 0, (mysql-proxy) total-vm:4917192kB, anon-rss:177720kB, file-rss:40kB
May  3 09:55:31 cz kernel: mysql-proxy: page allocation failure. order:0, mode:0x201da
May  3 09:55:31 cz kernel: Pid: 45331, comm: mysql-proxy Not tainted 2.6.32-431.5.1.el6.x86_64 #1
May  3 09:55:31 cz kernel: Call Trace:
May  3 09:55:31 cz kernel: [] ? __alloc_pages_nodemask+0x757/0x8d0
May  3 09:55:31 cz kernel: [] ? alloc_pages_current+0xaa/0x110
May  3 09:55:31 cz kernel: [] ? __page_cache_alloc+0x87/0x90
May  3 09:55:31 cz kernel: [] ? find_get_page+0x1e/0xa0
May  3 09:55:31 cz kernel: [] ? filemap_fault+0x1a7/0x500
May  3 09:55:31 cz kernel: [] ? __do_fault+0x54/0x530
May  3 09:55:31 cz kernel: [] ? handle_pte_fault+0xf7/0xb00
May  3 09:55:31 cz kernel: [] ? thread_return+0x4e/0x76e
May  3 09:55:31 cz kernel: [] ? __remove_hrtimer+0x3b/0xb0
May  3 09:55:31 cz kernel: [] ? lock_hrtimer_base+0x31/0x60
May  3 09:55:31 cz kernel: [] ? hrtimer_try_to_cancel+0x3f/0xd0
May  3 09:55:31 cz kernel: [] ? handle_mm_fault+0x22a/0x300
May  3 09:55:31 cz kernel: [] ? __do_page_fault+0x138/0x480
May  3 09:55:31 cz kernel: [] ? ep_poll+0x12e/0x330
May  3 09:55:31 cz kernel: [] ? default_wake_function+0x0/0x20
May  3 09:55:31 cz kernel: [] ? do_page_fault+0x3e/0xa0
May  3 09:55:31 cz kernel: [] ? page_fault+0x25/0x30

鉴于这种情况, 即便atlas做了高可用相关的架构, 在切换到另一台atlas之后也可能很快的被系统 OOM 掉, 可以临时禁止atlas进程参与系统评分, 尽量避免被系统杀掉.

for x in `pidof mysql-proxy`; do  echo '-17' > /proc/$x/oom_adj; done
  • 苏幕遮

    博主,你好,我遇到你文中说到的几个问题,但是没有得到解决,前来打扰您。

    公司使用的阿里云的数据库,然后搭配自己搭建的atlas,现在情况是atlas和数据库的链接是没问题的。但是在阿里云的监控端发现读库的“当前连接数”始终是0,而主库是有数值的。打开atals查看日志就看到这样的东西:

    2017-03-28 10:41:04: (message) chassis-unix-daemon.c:138: [angel] we try to keep PID=31136 alive
    2017-03-28 10:41:04: (message) mysql-proxy 0.8.2 started – instance: openhls
    2017-03-28 10:41:04: (message) proxy listening on port 0.0.0.0:3307
    2017-03-28 10:41:04: (message) added read/write backend: rm-bp140cd7xf48273m8.mysql.rds.aliyuncs.com:3306
    2017-03-28 10:41:04: (message) added read-only backend: rr-bp1x2v62r2558089s.mysql.rds.aliyuncs.com:3306
    2017-03-28 10:41:04: (message) chassis-event-thread.c:235: starting 8 threads
    2017-03-28 10:41:05: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:41:36: (warning) last message repeated 64 times
    2017-03-28 10:41:36: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:42:07: (warning) last message repeated 61 times
    2017-03-28 10:42:07: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:42:38: (warning) last message repeated 61 times
    2017-03-28 10:42:38: (warning) (libevent) event_del: event has no event_base set.
    2017-03-28 10:43:09: (warning) last message repeated 69 times
     

    看到已经发现了两个数据库,但是却不断地在刷(warning) (libevent) event_del: event has no event_base set.我把读库和主库的wait_timeout和interactive_timeout设置为300s(默认值是86400和7200),但是效果没变,请问怎么破?

    • admin

      后端的连接数满了吗? 参考文章的 resource limit 部分.

      • 苏幕遮

        阿里云2G的mysql 5.6最大连接数600(这个不能改,只能通过提升配置来改变),但是我们的连接数远远不足,还不到40…

        您这个只能传网络图片,无法上传本地图片…

        • admin

          atlas 的配置有吗? 可以简单贴出来, 不要贴密码信息

      • admin

        确保 atlas 能连接到读库上, 连不上读库也会出这个问题, 你可以手工测测连接.

        • 苏幕遮

          博主,我向开发询问他们查询数据库的语句,确认他们使用了事务,会不会是这个原因?

          • admin

            程序里显式指明了事务都会被 atlas 转到后端的 master 里, 不会路由到 slave,  你自己可以手动连接 atlas, 不要使用事务, 看看有没有 sql 转到 slave 里.  另外, 程序是否使用事务和你上面的报错没多少关系. 你在上面列出的错更像是 atlas 连不上 slave.

  • 苏幕遮

    因为我在模块端测试atlas是可以登陆的。如果可以的话,方便交换一下QQ或者电邮么?我可以截图到本地传给你。

    • admin

      可以 qq 联系 510621028