写点什么

Java 服务总在半夜挂,背后的真相竟然是... | 京东云技术团队

  • 2023-10-25
    北京
  • 本文字数:12703 字

    阅读完需:约 42 分钟

Java服务总在半夜挂,背后的真相竟然是... | 京东云技术团队

写在前面

最近有用户反馈测试环境 Java 服务总在凌晨 00:00 左右挂掉,用户反馈 Java 服务没有定时任务,也没有流量突增的情况,Jvm 配置也合理,莫名其妙就挂了

问题排查

问题复现

为了复现该问题,写了个 springboot 的 demo 部署在测试环境,其中 demo 里只做了 hello world 功能,应用类型为web_tomcat (war包部署),基础镜像是base_tomcat/java-centos6-jdk18-60-tom8050-ngx197,镜像使用的 Java 版本是1.8.0_60,有了上次MySQL被kill的经验,盲猜是 linux limit 惹的祸,因此将打好的镜像分别部署了两批不同的机器,果不其然,新机器当晚挂掉了,老机器服务正常



看一下挂掉的 limit 设置


排查过程

Java 进程会受到 limits 影响?

按理说 Java 进程是不会受到系统 limit open files(系统最大句柄数)影响的,但是为了验证这个问题,我们将他修改为正常机器的值,由于 demo 是web_tomcat应用,没法修改启动脚本,因此我们通过prlimit修改 java 进程的 limit


prlimit -p 32672 --nofile=1048576 
复制代码



结果当晚 00:00 左右还是挂了,看来 open files 和 java 进程挂掉没关系,看dmesg也没发现什么问题


Java 版本过低导致内存分配不合理?

通过寻求 jdos 研发组的帮助,jdos 研发组的同学认为是 java 版本的问题,低版本可能没有限制住申请的内存大小,具体原因如下


https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14


异常机器 java 内存情况



正常机器 java 内存情况



按照这个文档描述,使用 docker cgroups 限制内存可能会导致 JVM 进程被终止,原因是 Java 读取的还是宿主机的 CPU,而不是 docker cgroups 限制的 CPU,高版本的 Java 解决了这个问题,文档解决方案截图如下:



对此我们表示怀疑,因为我们的程序里设置了 JVM 参数



保持着试一试的心态,我们增加了一个实验组,实验组使用的 Java 版本是11.0.8


结果当晚实验组的 Java 进程还是死了,看来和 Java 版本也没关系


容器上存在定时任务导致的?

由于基础镜像是 jdos 官方提供的镜像,所以之前从来没有怀疑过是定时任务的问题,但是现在别无他法了,检查下容器的定时任务


虽然有定时任务,但是这个执行的时间点和 Java 挂掉的时间对不上,为此我们决定删除定时任务试试


结果当晚 Java 进程还是挂了,并且这次有 dmesg 的日志,发现 Java 被 kill 的同时 crond 也被 kill 了,被 kill 的原因是 crond 内存过高导致 oom



难道还有系统级 cron 任务?于是查了一下/etc/crontab,发现果然还有 cron 任务(这是谁打的镜像!!!)



这个时间点和 Java 进程挂掉的时间点吻合,但是问题来了,执行的任务并没有logrotate.sh这个脚本,应该不会出现问题才对



到底是不是定时任务的问题,我们修改下 cron 的时间验证下,调整时间为中午 11:00,验证下 Java 进程是否会挂,同时使用strace打印进程 trace log



果然 Java 进程在中午 11.00 挂了,看来真的是 cron 任务导致的,让我们一起看一下 strace


19:59:01 close(3)                        = 019:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019:59:01 open("/etc/pam.d/crond", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "#\n# The PAM configuration file f"..., 4096) = 29319:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 519:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\17\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 019:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd76932200019:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 019:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd76952500019:59:01 close(5) = 019:59:01 open("/etc/ld.so.cache", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 019:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f800019:59:01 close(5) = 019:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 519:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p@\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 019:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd76910900019:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e00019:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd76932000019:59:01 close(5) = 019:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 019:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 019:59:01 munmap(0x7fd7707f8000, 16203) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240&\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 019:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef000019:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb00019:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd00019:59:01 close(6)                       = 019:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 019:59:01 read(5, "", 4096)              = 019:59:01 close(5) = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 519:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\t\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 019:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced00019:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee00019:59:01 close(5) = 019:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\10\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 019:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                      = 0x7fd768aea00019:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE)                     = 019:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb00019:59:01 close(6) = 019:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 019:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\20\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 019:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e500019:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae800019:59:01 close(6) = 019:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 019:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\v\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 019:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e100019:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e300019:59:01 close(6) = 019:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ)                       = 019:59:01 read(5, "", 4096) = 019:59:01 close(5)                     = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY)                      = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                      = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\r\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 019:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7684dc00019:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df00019:59:01 close(6) = 019:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ)                     = 019:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 619:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\5\0\0\0\0\0\0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 019:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7682da00019:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0)                      = 0x7fd7684da00019:59:01 close(6) = 019:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 019:59:01 read(5, "", 4096) = 019:59:01 close(5) = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 read(3, "", 4096)             = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096)                      = 019:59:01 open("/etc/pam.d/other", O_RDONLY)                      = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)   = 0x7fd77080400019:59:01 read(3, "#%PAM-1.0\nauth     required     "..., 4096) = 15419:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC)   = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 019:59:01 open("/etc/security/access.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# Login access control table.\n#\n"..., 4096) = 409619:59:01 read(3, " should get access from ipv4 net"..., 4096) = 52419:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getuid() = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)  = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3)                       = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 geteuid() = 019:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 90119:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096)                      = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                       = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)   = 019:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 1519:59:01 sendto(3, "p\0\0\0M\4\5\0\1\0\0\0\0\0\0\0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                      = 11219:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "#\n# This is the configuration fi"..., 4096) = 298019:59:01 read(3, "", 4096) = 019:59:01 close(3)                      = 019:59:01 munmap(0x7fd770804000, 4096)                       = 019:59:01 open("/etc/environment", O_RDONLY)   = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)               = 0x7fd77080400019:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 019:59:01 sendto(3, "p\0\0\0O\4\5\0\2\0\0\0\0\0\0\0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                       = 11219:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW)        = 319:59:01 write(3, "0", 1) = 119:59:01 close(3) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getuid() = 019:59:01 getgid() = 019:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 49646638519:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 78570213219:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 019:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64*1024, rlim_max=64*1024}) = 019:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 019:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800*1024, rlim_max=800*1024}) = 019:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 019:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 019:59:01 getpriority(PRIO_PROCESS, 0) = 2019:59:01 open("/etc/security/limits.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# /etc/security/limits.conf\n#\n#E"..., 4096) = 183519:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 319:59:01 getdents(3, /* 3 entries */, 32768) = 8819:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY)                       = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 019:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f500019:59:01 close(5)  = 019:59:01 getdents(3, /* 0 entries */, 32768) = 019:59:01 close(3) = 019:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# Default limit for number of us"..., 4096) = 19319:59:01 read(3, "", 4096)              = 019:59:01 close(3)                       = 019:59:01 munmap(0x7fd770804000, 4096)   = 019:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 setpriority(PRIO_PROCESS, 0, 0) = 019:59:01 getuid() = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096)                     = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                      = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)                      = 019:59:01 sendto(3, "t\0\0\0Q\4\5\0\3\0\0\0\0\0\0\0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 11619:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 setgid(0) = 019:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 319:59:01 read(3, "65536\n", 31)         = 619:59:01 close(3)                       = 019:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 319:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)19:59:01 close(3) = 019:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 319:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110)                       = -1 ENOENT (No such file or directory)19:59:01 close(3) = 019:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 lseek(3, 0, SEEK_CUR) = 019:59:01 read(3, "root:x:0:\nbin:x:1:bin,daemon\ndae"..., 4096) = 49719:59:01 read(3, "", 4096)              = 019:59:01 close(3)                       = 019:59:01 munmap(0x7fd770804000, 4096)                     = 019:59:01 setgroups(1, [0]) = 019:59:01 setreuid(0, 4294967295) = 019:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 019:59:01 pipe([3, 5])                   = 019:59:01 pipe([6, 7])                   = 019:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 194619:59:01 gettid()                     = 194319:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 819:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)19:59:01 close(8) = 019:59:01 close(3) = 019:59:01 close(7) = 019:59:01 close(5) = 019:59:01 fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd77080400019:59:01 lseek(6, 0, SEEK_CUR)                     = -1 ESPIPE (Illegal seek)19:59:01 read(6, "/bin/bash: ./logrotate.sh: \346\262\241\346\234"..., 4096) = 5519:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 019:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 019:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da00019:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---19:59:06 +++ killed by SIGKILL +++
复制代码


可以看到最后用 mmap 一次分配了 4G 内存,然后就被 kill 了。


mmap前调用了getrlimit,和上次MySQL的问题一样,都是根据系统资源限制来分配内存


为了确定就是 cron 导致 java 挂掉的元凶,我们把 cron 进程手动 kill 掉,这样就不会执行定时任务了,这次我们在验证下 Java 进程是否会挂掉


果不其然,Java 进程并没有挂掉,看来真的是 cron 任务导致的

高版本 CentOS 是否也会出现类似问题?

按理说 oom killer 应该只 kill 掉占用内存最高的才对,Java 进程占用内存又不是最高的,高版本的 CentOS 系统 oom killer 策略会不会有升级?



让我们来一起验证下高版本的 CentOS 系统是否有这个问题


当前镜像的 CentOS 版本是CentOS release 6.6 (Final),为了验证高版本的 CentOS 是否也有类似的问题,我们将增加两个实验组,分别升级基础镜像至CentOS release 6.10 (Final)CentOS Linux release 7.9.2009 (Core),也添加相同的 cron 任务


结果发现CentOS release 6.10 (Final)CentOS Linux release 7.9.2009 (Core)都没有 kill 掉 Java 进程,只 kill 掉了 cron 的子进程


结论


由于容器limit open files(系统最大句柄数)设置不合理导致 cron 执行任务时使容器内存飙升,存在内存溢出的风险,linux 由于保护机制会 kill 掉占用内存高的进程,导致 cron 子任务进程和 Java 进程一起被 kill(但是问题来了,这个 jdos 基础镜像为什么会执行一个完全不存在的 shell 脚本,而且还是执行两次???),高版本的 CentOS 系统不会 kill java 进程,猜测不同版本的 CentOS 的 kill 选择策略略有不同

问题分析

Cron 任务执行逻辑

在 Linux 中,crontab 工具是由 croine 软件包提供的,让我们一起看下 cron 的执行过程



其中 child_process()执行了 cron 子进程,cron 执行子进程时会有发送 mail 的动作



cron_popen 在执行时会按照 open files(系统最大句柄数)清除内存




综上,cron oom 的原因找到了,是由于 open files 设置过大且 cron 任务没有标准输出,导致执行了发送 mail 逻辑,而清除的内存大小超出了容器本身内存的大小,导致 oom。


croine 1.5.4 版本之后修复了该问题,如果想查看当前容器 croine 版本可执行如下命令:


rpm -q cronie
复制代码


Linux 内核 OOM killer 机制

Linux 内核有个机制叫 OOM killer(Out Of Memory killer),该机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,然后防止内存耗尽而自动把该进程杀掉。内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码 linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory()被触发,然后调用select_bad_process()选择一个”bad”进程杀掉。


以下是一些主要的进程选择策略:


  1. 内存使用情况:OOM Killer 首先倾向于选择占用内存最多的进程,因为终止这些进程可以释放最多的内存。

  2. OOM 分数:每个进程都有一个 OOM 分数,该分数是基于其内存使用情况和其他因素计算出来的。OOM Killer 倾向于终止 OOM 分数最高的进程。

  3. 进程优先级:在选择要终止的进程时,OOM Killer 通常会避免终止对系统至关重要的系统进程。这些进程通常具有较高的优先级,因此它们更不容易成为终止目标。

  4. 进程资源需求:OOM Killer 还会考虑进程的资源需求。它倾向于终止那些请求较少资源的进程,以最小化影响其他进程的运行。

  5. 进程属性:某些进程可能被标记为不可终止,例如通过设置/proc/[PID]/oom_score_adj 的值来调整 OOM 分数。这些进程通常不容易被 OOM Killer 终止。


注:不同版本的 Linux oom killer 机制可能会存在一些差异

解决方案

使用高版本稳定的 CentOS 系统,如果业务无法升级 CentOS,则需要设置合理的limit open files数量,application_worker 类型应用可以在启动脚本中手动修改 limit,web_tomcat 类型应用没法修改启动脚本,可以选择 kill 掉 cron 进程或删除系统 cron 任务,也可以手动升级cronie的版本至1.5.7-5

写在后面

open files这个坑很大,栽这个坑两次了,大家一定要检查自己服务对应容器的 CentOS 版本和 limit 设置是否合理,本次案例发生在测试环境,尚不会引起事故,如果在生产出现类似情况,后果不堪设想


由于测试环境新增的这批机器都存在这个问题,我们团队已经联系机器提供方上报了该问题,后续这批机器会由提供方统一修改系统最大句柄数,如果当前问题影响到了业务的正常使用,可以临时删除容器中/etc/crontab中的任务

参考文献

https://cloud.tencent.com/developer/article/1183262


https://github.com/cronie-crond/cronie


作者:京东零售 杨云龙

来源:京东云开发者社区 转载请注明来源

发布于: 刚刚阅读数: 5
用户头像

拥抱技术,与开发者携手创造未来! 2018-11-20 加入

我们将持续为人工智能、大数据、云计算、物联网等相关领域的开发者,提供技术干货、行业技术内容、技术落地实践等文章内容。京东云开发者社区官方网站【https://developer.jdcloud.com/】,欢迎大家来玩

评论

发布
暂无评论
Java服务总在半夜挂,背后的真相竟然是... | 京东云技术团队_Java_京东科技开发者_InfoQ写作社区