从 Spring 框架中的一个 BUG 来分析锁使用的问题
锁的使用是一个非常细致的事情,如果使用不当会导致系统性能急剧下降,下面我们通过分析 Spring 框架中一个锁使用错误的一个 BUG 来总结一下 Lock 这部分的知识。 关于 BUG 的细节,可以 Google " MimeTypeUtils LRU cache " 关键词找到。
本文将关键的代码摘出。
测试有问题版本的 ConcurrentLruCache
注意: metrics 属性及其逻辑,是为了收集指标加入的,原本 Spring 框架代码中没有这段
/**
* 并发的Lur Cache
* @param <K>
* @param <V>
*/
private static class ConcurrentLruCache<K, V> implements LruCache<K, V>{
private final int maxSize;
private final ConcurrentLinkedQueue<K> queue = new ConcurrentLinkedQueue<>();
private final ConcurrentHashMap<K, V> cache = new ConcurrentHashMap<>();
private final ReadWriteLock lock = new ReentrantReadWriteLock();
private final Function<K, V> generator;
private final List<Long> metrics;
public ConcurrentLruCache(int maxSize, Function<K, V> generator , List<Long> metrics) {
Assert.isTrue(maxSize > 0, "LRU max size should be positive");
Assert.notNull(generator, "Generator function should not be null");
this.maxSize = maxSize;
this.generator = generator;
this.metrics = metrics;
}
@Override
public V get(K key) {
this.lock.readLock().lock();
try {
if (this.queue.size() < this.maxSize / 2) {
V cached = this.cache.get(key);
if (cached != null) {
return cached;
}
}
else if (this.queue.remove(key)) {
this.queue.add(key);
return this.cache.get(key);
}
}
finally {
this.lock.readLock().unlock();
}
this.lock.writeLock().lock();
long s = System.currentTimeMillis();
try {
// retrying in case of concurrent reads on the same key
if (this.queue.remove(key)) {
this.queue.add(key);
return this.cache.get(key);
}
if (this.queue.size() == this.maxSize) {
K leastUsed = this.queue.poll();
if (leastUsed != null) {
this.cache.remove(leastUsed);
}
}
V value = this.generator.apply(key);
this.queue.add(key);
this.cache.put(key, value);
return value;
}
finally {
metrics.add(System.currentTimeMillis()-s);
this.lock.writeLock().unlock();
}
}
}
我们通过 benchmarkLruCache 的程序进行测试,这个程序也会测试修改后的版本。
/**
*
* @param loop 循环次数
* @param sample 模拟的VALUE不同种类数
* @param lruCache 测试的Lru实现
* @return
*/
private static List<Long> benchmarkLruCache(int loop , int sample , LruCache<String,Object> lruCache){
List<Long> times = Collections.synchronizedList(new ArrayList<>(loop));
CountDownLatch countDownLatch = new CountDownLatch(loop);
ExecutorService executor = Executors.newFixedThreadPool(100);
IntStream.rangeClosed(1,loop)
.forEach(i ->{
executor.execute(() -> {
long s = System.currentTimeMillis();
IntStream.rangeClosed(1,sample)
.forEach(n -> {
String v = n+"";
if(!lruCache.get(v).equals(v .hashCode())){
throw new RuntimeException("结果错误");
}
});
times.add((System.currentTimeMillis() - s));
countDownLatch.countDown();
});
});
executor.shutdown();
try {
countDownLatch.await();
}catch (InterruptedException e){}
return times;
}
构造 concurrentLruCache 进行测试
int loop = 10000;
int maxSize = 64;
int sample = 65;
List<Long> metrics1 = Collections.synchronizedList(new ArrayList<>(loop * sample));
LruCache<String,Object> concurrentLruCache = new ConcurrentLruCache<String, Object>(maxSize, v -> v.hashCode(),metrics1);
System.out.println("有问题的 ConcurrentLruCache 版本总耗时 : "+benchmarkLruCache(loop,sample,concurrentLruCache).stream()
// .peek(System.out::println)
.mapToLong(v -> v)
.sum());
System.out.println("有问题的 ConcurrentLruCache 写次数与耗时 : "+metrics1.stream().mapToLong(v -> v).summaryStatistics());
测试结果
有问题的 ConcurrentLruCache 版本总耗时 : LongSummaryStatistics{count=650000, sum=127519404, min=0, average=196.183698, max=1222}
有问题的 ConcurrentLruCache 写次数与耗时 : LongSummaryStatistics{count=267141, sum=20882, min=0, average=0.078168, max=2}
测试优化后版本的 ConcurrentLruCache
private static class OptimizeConcurrentLruCache<K, V> implements LruCache<K, V>{
private final int maxSize;
private final ConcurrentLinkedDeque<K> queue = new ConcurrentLinkedDeque<>();
private final ConcurrentHashMap<K, V> cache = new ConcurrentHashMap<>();
private final ReadWriteLock lock;
private final Function<K, V> generator;
private volatile int size = 0;
private final List<Long> metrics;
public OptimizeConcurrentLruCache(int maxSize, Function<K, V> generator ,List<Long> metrics ) {
Assert.isTrue(maxSize > 0, "LRU max size should be positive");
Assert.notNull(generator, "Generator function should not be null");
this.maxSize = maxSize;
this.generator = generator;
this.lock = new ReentrantReadWriteLock();
this.metrics = metrics;
}
@Override
public V get(K key) {
V cached = this.cache.get(key);
if (cached != null) {
if (this.size < this.maxSize) {
return cached;
}
this.lock.readLock().lock();
try {
if (this.queue.removeLastOccurrence(key)) {
this.queue.offer(key);
}
return cached;
}
finally {
this.lock.readLock().unlock();
}
}
this.lock.writeLock().lock();
long s = System.currentTimeMillis();
try {
// Retrying in case of concurrent reads on the same key
cached = this.cache.get(key);
if (cached != null) {
if (this.queue.removeLastOccurrence(key)) {
this.queue.offer(key);
}
return cached;
}
// Generate value first, to prevent size inconsistency
V value = this.generator.apply(key);
int cacheSize = this.size;
if (cacheSize == this.maxSize) {
K leastUsed = this.queue.poll();
if (leastUsed != null) {
this.cache.remove(leastUsed);
cacheSize--;
}
}
this.queue.offer(key);
this.cache.put(key, value);
this.size = cacheSize + 1;
return value;
}
finally {
metrics.add(System.currentTimeMillis() - s);
this.lock.writeLock().unlock();
}
}
}
构造 concurrentLruCache 进行测试,测试参数保持一致
List<Long> metrics2 = Collections.synchronizedList(new ArrayList<>(loop * sample));
LruCache<String,Object> optimizeConcurrentLruCache = new OptimizeConcurrentLruCache<String, Object>(maxSize, v -> v.hashCode(),metrics2);
System.out.println("优化后的 ConcurrentLruCache 版本总耗时 : "+benchmarkLruCache(loop,sample,optimizeConcurrentLruCache).stream()
// .peek(System.out::println)
.mapToLong(v -> v)
.summaryStatistics());
System.out.println("优化后的 ConcurrentLruCache 写次数与耗时 : "+metrics2.stream().mapToLong(v -> v).summaryStatistics());
测试结果
优化后的 ConcurrentLruCache 版本总耗时 : LongSummaryStatistics{count=650000, sum=1881120, min=0, average=2.894031, max=62}
优化后的 ConcurrentLruCache 写次数与耗时 : LongSummaryStatistics{count=56995, sum=44, min=0, average=0.000772, max=19}
原因分析
我们通过 jstack 打印出线程堆栈
"pool-1-thread-100" #109 prio=5 os_prio=31 tid=0x00007fdc0708c800 nid=0xe403 waiting on condition [0x000070000a660000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
// .... 省略
at com.izhengyin.demo.concurrent.part6.LruCacheIssueTest$ConcurrentLruCache.get(LruCacheIssueTest.java:133)
// .... 省略
"pool-1-thread-97" #106 prio=5 os_prio=31 tid=0x00007fdc07817800 nid=0xe003 waiting on condition [0x000070000a357000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
// .... 省略
at com.izhengyin.demo.concurrent.part6.LruCacheIssueTest$ConcurrentLruCache.get(LruCacheIssueTest.java:114)
// .... 省略
通过堆栈信息,可以直观的看到,线程都在 WAITING 状态,分别是在 LruCacheIssueTest.java 133 行与 114 行,查看源码,这两行分别对应的写锁的获取与读锁的获取
类: ConcurrentLruCache
public V get(K key) {
... 省略
this.lock.readLock().lock(); 114 行
... 省略
this.lock.writeLock().lock(); 113 行
}
分析源码,按照我们给定的参数 (maxSize < sample)
get 方法每次执行时都会先获取读锁 , 如果此时写锁为释放那么就会阻塞
按照我们的样品数,分析读锁代码块里的代码,这段代码有三个问题 (看注释)
2.1 操作 queue 未加写锁,因此 queue 会被不断的 Remove / add , 同时也导致了很多 this.queue.remove(key) 是不会命中的,这样就讲流程下放到了写操作,这种情况线程数越多竞争的越厉害,情况越会加剧
2.2 this.queue 的 remove 是从头往后找,而 add 是网队尾添加,所以大部分情况都会进行 N 次扫描,效率不高
2.3 ConcurrentLinkedQueue remove ,存在内存泄漏问题。据描述是 8u60 版本引入的,我下下载了 8u101 与 8u102 的源代码,确定问题在 8u102 被修复,也就是说 java8 的部分版本也收到了影响。
参考官方的 BUG 描述 https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8137184
java8 历史版本对照列表 https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
//这段代码不会被满足 ,因为 queue "一直" 是满的
if (this.queue.size() < this.maxSize / 2) {
V cached = this.cache.get(key);
if (cached != null) {
return cached;
}
}
//这段代码有二个问题
// 1. 此次操作queue未加写锁,因此 queue 会被不断的Remove / add , 同时也导致了很多 this.queue.remove(key) 是不会命中的,这样就讲流程下放到了写操作,这种情况线程数越多竞争的越厉害,情况越会加剧
// 2. this.queue 的 remove 是从头往后找,而add是网队尾添加,所以大部分情况都会进行N次扫描,效率不高
else if (this.queue.remove(key)) {
this.queue.add(key);
return this.cache.get(key);
}
写操作上逻辑上没有太大问题,主要是大量的写锁或阻塞读,读锁也会反过来阻塞写锁的获取。所以问题代码主要还是读操作代码段的问题
优化后的版本分析
采用了 ConcurrentLinkedDeque 替换了 ConcurrentLinkedQueue , 通过 removeLastOccurrence 从后往前删除,减少了删除操作的扫码次数。
读操作,先不加锁,只要能读到,就返回,极大的减少写流程的次数,这也就减少了获取锁的等待
V cached = this.cache.get(key);
if (cached != null) {
//这个条件不会成立,但影响不大了
if (this.size < this.maxSize) {
return cached;
}
this.lock.readLock().lock();
try {
// 采用了 ConcurrentLinkedDeque 替换了 ConcurrentLinkedQueue , 通过 removeLastOccurrence 从后往前删除,减少了删除操作的扫码次数。
if (this.queue.removeLastOccurrence(key)) {
this.queue.offer(key);
}
// 只要能读到就返回
return cached;
}
finally {
this.lock.readLock().unlock();
}
}
我们将两次的结果放在一起做一个直观对比,可以很直观的看到优化前后的差异。
有问题的 ConcurrentLruCache 版本总耗时 : LongSummaryStatistics{count=650000, sum=127519404, min=0, average=196.183698, max=1222}
有问题的 ConcurrentLruCache 写次数与耗时 : LongSummaryStatistics{count=267141, sum=20882, min=0, average=0.078168, max=2}
优化后的 ConcurrentLruCache 版本总耗时 : LongSummaryStatistics{count=650000, sum=1881120, min=0, average=2.894031, max=62}
优化后的 ConcurrentLruCache 写次数与耗时 : LongSummaryStatistics{count=56995, sum=44, min=0, average=0.000772, max=19}
版权声明: 本文为 InfoQ 作者【郑印】的原创文章。
原文链接:【http://xie.infoq.cn/article/52ca9504e3f9e038fe61e047d】。
本文遵守【CC BY-NC】协议,转载请保留原文出处及本版权声明。
郑印
还未添加个人签名 2017.10.17 加入
还未添加个人简介
评论