写点什么

JFR 定位线上问题实例 - JFR 导致的雪崩问题定位与解决

用户头像
AI乔治
关注
发布于: 2021 年 01 月 13 日
JFR定位线上问题实例 - JFR导致的雪崩问题定位与解决

最近发现一个应用,使用默认的 JFR 配置,发生了类似于雪崩的现象,这里记录下,引以为戒。

Key takeaways

  1. JFR 的线程堆栈 dump 采集默认周期是 everyChunk,也就是每次新建一个 Chunk 就会采集一次。默认每个 Chunk 大小为 12M,在线程比较多的时候,堆栈 dump 可能大于 12M,导致 JFR 一直切换新的 Chunk,然后又触发 evenryChunk 导致无限循环而雪崩。

  2. 对于 JFR 默认事件采集配置(位于 JDK 目录/lib/jfr/default.jfc),每个采集周期和 Chunk 相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用 Chunk 作为指标,防止上面这个问题。

  3. 只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。

背景介绍

线上某个线程数非常多的应用,线程是空闲状态,有任务提交过来,但是过了 10s,任务才开始执行,并且这一段时间有很多类似的情况,例如:

2020-12-01 13:07:34.515  INFO [微服务名称,298ee1f359f67ff7,298ee1f359f67ff7] [19] [线程全局唯一名称:调度线程] 任务提交到线程池
.....
2020-12-01 13:07:43.623 INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 开始处理2020-12-01 13:07:44.159 INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 处理完成
复制代码

查看 24684780-0 这个线程池的统计当时的队列是空的,所有线程都是等待任务的状态。排除了有其他任务占满核心线程池导致等待的情况。

线程一直空闲,但是却不能接受任务处理,表示这个线程可能一直处于需要进入安全点 SafePoint 的场景。查看 GC,当时的 GC 是正常的,证明并不是 GC 引起的需要进入安全点



查看当时的 safepoint 日志(我们线上的进程开启了 safepoint 日志,参数:-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M),发现在这期间内,有很多很多的进入 safepoint 的处理(我们只看了进入安全点(Entering safepoint region)这一行):

[2020-12-01T13:07:32.290+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:32.335+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:32.635+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:32.655+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:32.853+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:32.923+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:32.968+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:33.269+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:33.286+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:33.480+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:33.524+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:33.806+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:33.822+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:33.949+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:34.019+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:34.064+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:34.346+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:34.361+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:34.510+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:34.555+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:34.839+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:34.861+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:35.023+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:35.097+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:35.140+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:35.423+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:35.439+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:35.614+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:35.654+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:35.943+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:35.959+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:36.089+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:36.172+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:36.216+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:36.498+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:36.515+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:36.706+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:36.749+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:37.036+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:37.052+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:37.184+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:37.258+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:37.301+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:37.624+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:37.648+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:37.906+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:37.955+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:38.248+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:38.265+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:38.431+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation[2020-12-01T13:07:38.551+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:38.620+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:38.686+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:38.978+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:39.001+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:39.214+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:39.269+0000][info ][safepoint] Entering safepoint region: PrintThreads[2020-12-01T13:07:39.552+0000][info ][safepoint] Entering safepoint region: PrintJNI[2020-12-01T13:07:39.568+0000][info ][safepoint] Entering safepoint region: FindDeadlocks[2020-12-01T13:07:39.733+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint[2020-12-01T13:07:39.813+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation[2020-12-01T13:07:39.860+0000][info ][safepoint] Entering safepoint region: PrintThreads
复制代码

看上去很像 JFR 的定时采集,通过 top -Hp 查看线程 cpu 占用:

top - 01:31:46 up 10:01,  1 user,  load average: 2.16, 2.55, 2.62Threads: 25390 total,   2 running, 25388 sleeping,   0 stopped,   0 zombie%Cpu(s): 17.0 us,  4.7 sy,  0.0 ni, 74.0 id,  4.1 wa,  0.0 hi,  0.2 si,  0.0 stKiB Mem : 32120452 total,   287936 free, 20438864 used, 11393652 buff/cacheKiB Swap:        0 total,        0 free,        0 used. 11485760 avail Mem 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63120 root 20 0 26.8g 9.4g 22016 R 73.2 30.8 178:50.29 VM Thread 63130 root 20 0 26.8g 9.4g 22016 S 15.7 30.8 45:19.81 JFR Periodic Ta 63114 root 20 0 26.8g 9.4g 22016 S 6.6 30.8 17:31.55 GC Thread#0 63185 root 20 0 26.8g 9.4g 22016 S 6.6 30.8 44:59.01 Log4j2-TF-10-As 63346 root 20 0 26.8g 9.4g 22016 S 6.3 30.8 17:31.89 GC Thread#2 63345 root 20 0 26.8g 9.4g 22016 S 6.0 30.8 17:31.70 GC Thread#1 63347 root 20 0 26.8g 9.4g 22016 S 5.7 30.8 17:31.86 GC Thread#3 63128 root 20 0 26.8g 9.4g 22016 S 5.4 30.8 13:35.61 JFR Recorder Th 63134 root 20 0 26.8g 9.4g 22016 S 4.2 30.8 13:21.16 java 63569 root 20 0 26.8g 9.4g 22016 S 1.8 30.8 19:20.94 lettuce-epollEv 64742 root 20 0 26.8g 9.4g 22016 S 1.2 30.8 1:03.26 XNIO-2 task-6 2777 root 20 0 26.8g 9.4g 22016 S 1.2 30.8 0:46.15 XNIO-2 task-24 。。。。。。
复制代码

发现使用 CPU 最多的线程是 VM 线程,其次是 JFR 线程。怀疑是因为 JFR 采集的原因造成的所有线程一直进入安全点的问题。我们的 JFR 采集事件的配置是默认的配置,也就是 JDK 目录下的 lib 目录下的 jfr 目录下的 default.jfc 查看其中的线程 Dump 配置:

<event name="jdk.ThreadDump">	<setting name="enabled" control="thread-dump-enabled">true</setting>	<setting name="period" control="thread-dump-interval">everyChunk</setting></event>
复制代码

发现默认是 everyChunk,也就是每次进入新的 Chunk 会打印一次。那么什么是 Chunk?啥时候进入新的 Chunk?



在 JFR 中,所有的 Event (包括通过 JFR API 产生的 Event 还有 JVM 产生的 EVENT),会先存储到每个线程自己的 Thread Buffer 中;在这个 Buffer 满了之后,会将 Buffer 的内容刷入 Global Buffer 中;Global Buffer 是一个环形 Buffer,保存着所有线程发送来的 Thread Buffer 中的内容。当这个环形 Buffer 存储到达上限之后,根据配置,会选择丢弃或者刷入文件,这里默认是刷入文件。这个文件,就可以理解为 Chunk。

Chunk的大小是有限制的,默认是12m.但是,这个进程的线程很多,导致线程 dump 超过了 12m。导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环进入安全点。

解决方案

发现问题之后,首先想到的是,减少线程数量。由于业务需要,这个业务的线程主要用来做业务隔离以及有序执行,同一时间内的 Running 线程数量并不会很多。只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。所以暂时先不考虑减少线程数量。

最后的解决方案是,对于 JFR 默认事件采集配置,每个采集周期和 Chunk 相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用 Chunk 作为指标,防止出现某个采集雪崩导致不断切换 chunk 导致无限采集这个问题。


看完三件事❤️



如果你觉得这篇内容对你还蛮有帮助,我想邀请你帮我三个小忙:



  1. 点赞,转发,有你们的 『点赞和评论』,才是我创造的动力。

  2. 关注公众号 『 java 烂猪皮 』,不定期分享原创知识。

  3. 同时可以期待后续文章 ing🚀

  4. .关注后回复【666】扫码即可获取学习资料包




作者:张哈希

链接:https://club.perfma.com/article/2068590


用户头像

AI乔治

关注

分享后端技术干货。公众号【 Java烂猪皮】 2019.06.30 加入

一名默默无闻的扫地僧!

评论

发布
暂无评论
JFR定位线上问题实例 - JFR导致的雪崩问题定位与解决