首发于JFR全解

深度探索JFR - JFR定位线上问题实例 - JFR导致的雪崩问题定位与解决

最近发现一个应用,使用默认的 JFR 配置,发生了类似于雪崩的现象,这里记录下,引以为戒。

Key takeaways

  1. JFR 的线程堆栈 dump 采集默认周期是everyChunk,也就是每次新建一个Chunk就会采集一次。默认每个Chunk大小为 12M,在线程比较多的时候,堆栈 dump 可能大于 12M,导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环而雪崩。
  2. 对于 JFR 默认事件采集配置(位于JDK目录/lib/jfr/default.jfc),每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止上面这个问题。
  3. 只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。

背景介绍

线上某个线程数非常多的应用,线程是空闲状态,有任务提交过来,但是过了 10s,任务才开始执行,并且这一段时间有很多类似的情况,例如:

2020-12-01 13:07:34.515  INFO [微服务名称,298ee1f359f67ff7,298ee1f359f67ff7] [19] [线程全局唯一名称:调度线程] 任务提交到线程池

.....

2020-12-01 13:07:43.623  INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 开始处理
2020-12-01 13:07:44.159  INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 处理完成

查看24684780-0这个线程池的统计当时的队列是空的,所有线程都是等待任务的状态。排除了有其他任务占满核心线程池导致等待的情况。

线程一直空闲,但是却不能接受任务处理,表示这个线程可能一直处于需要进入安全点SafePoint的场景。查看GC,当时的 GC 是正常的,证明并不是 GC 引起的需要进入安全点



查看当时的 safepoint 日志(我们线上的进程开启了 safepoint 日志,参数:-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M),发现在这期间内,有很多很多的进入 safepoint 的处理(我们只看了进入安全点(Entering safepoint region)这一行):

[2020-12-01T13:07:32.290+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.335+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:32.635+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:32.655+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:32.853+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:32.923+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.968+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.269+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.286+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.480+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:33.524+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.806+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.822+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.949+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:34.019+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.064+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.346+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.361+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:34.510+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.555+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.839+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.861+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.023+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:35.097+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.140+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.423+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.439+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.614+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.654+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.943+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.959+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.089+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:36.172+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.216+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:36.498+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:36.515+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.706+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.749+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.036+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.052+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.184+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:37.258+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.301+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.624+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.648+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.906+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.955+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.248+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:38.265+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:38.431+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation
[2020-12-01T13:07:38.551+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:38.620+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:38.686+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.978+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.001+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.214+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.269+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:39.552+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.568+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.733+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:39.813+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.860+0000][info ][safepoint] Entering safepoint region: PrintThreads

看上去很像 JFR 的定时采集,通过 top -Hp查看线程cpu占用:

top - 01:31:46 up 10:01,  1 user,  load average: 2.16, 2.55, 2.62
Threads: 25390 total,   2 running, 25388 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.0 us,  4.7 sy,  0.0 ni, 74.0 id,  4.1 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32120452 total,   287936 free, 20438864 used, 11393652 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11485760 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                      
63120 root      20   0   26.8g   9.4g  22016 R 73.2 30.8 178:50.29 VM Thread                                                                                                                                                                    
63130 root      20   0   26.8g   9.4g  22016 S 15.7 30.8  45:19.81 JFR Periodic Ta                                                                                                                                                              
63114 root      20   0   26.8g   9.4g  22016 S  6.6 30.8  17:31.55 GC Thread#0                                                                                                                                                                  
63185 root      20   0   26.8g   9.4g  22016 S  6.6 30.8  44:59.01 Log4j2-TF-10-As                                                                                                                                                              
63346 root      20   0   26.8g   9.4g  22016 S  6.3 30.8  17:31.89 GC Thread#2                                                                                                                                                                  
63345 root      20   0   26.8g   9.4g  22016 S  6.0 30.8  17:31.70 GC Thread#1                                                                                                                                                                  
63347 root      20   0   26.8g   9.4g  22016 S  5.7 30.8  17:31.86 GC Thread#3                                                                                                                                                                  
63128 root      20   0   26.8g   9.4g  22016 S  5.4 30.8  13:35.61 JFR Recorder Th                                                                                                                                                              
63134 root      20   0   26.8g   9.4g  22016 S  4.2 30.8  13:21.16 java                                                                                                                                                                         
63569 root      20   0   26.8g   9.4g  22016 S  1.8 30.8  19:20.94 lettuce-epollEv                                                                                                                                                              
64742 root      20   0   26.8g   9.4g  22016 S  1.2 30.8   1:03.26 XNIO-2 task-6                                                                                                                                                                
 2777 root      20   0   26.8g   9.4g  22016 S  1.2 30.8   0:46.15 XNIO-2 task-24    
 。。。。。。

发现使用 CPU 最多的线程是 VM 线程,其次是 JFR 线程。怀疑是因为 JFR 采集的原因造成的所有线程一直进入安全点的问题。我们的 JFR 采集事件的配置是默认的配置,也就是 JDK 目录下的 lib 目录下的 jfr 目录下的 default.jfc 查看其中的线程 Dump 配置:

<event name="jdk.ThreadDump">
	<setting name="enabled" control="thread-dump-enabled">true</setting>
	<setting name="period" control="thread-dump-interval">everyChunk</setting>
</event>

发现默认是everyChunk,也就是每次进入新的Chunk会打印一次。那么什么是Chunk?啥时候进入新的Chunk



在 JFR 中,所有的 Event (包括通过JFR API产生的 Event 还有 JVM 产生的 EVENT),会先存储到每个线程自己的 Thread Buffer 中;在这个 Buffer 满了之后,会将 Buffer 的内容刷入 Global Buffer 中;Global Buffer 是一个环形 Buffer,保存着所有线程发送来的 Thread Buffer 中的内容。当这个环形 Buffer 存储到达上限之后,根据配置,会选择丢弃或者刷入文件,这里默认是刷入文件。这个文件,就可以理解为Chunk

Chunk的大小是有限制的,默认是12m.但是,**这个进程的线程很多,导致线程 dump 超过了 12m。**导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环进入安全点。

解决方案

发现问题之后,首先想到的是,减少线程数量。由于业务需要,这个业务的线程主要用来做业务隔离以及有序执行,同一时间内的 Running 线程数量并不会很多。只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。所以暂时先不考虑减少线程数量。

最后的解决方案是,对于 JFR 默认事件采集配置,每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止出现某个采集雪崩导致不断切换chunk导致无限采集这个问题。

发布于 2020-12-02 18:42