从一次排查 ES 线上问题得出的总结——熔断机制

用户头像
罗琦
关注
发布于: 2020 年 05 月 17 日
从一次排查ES线上问题得出的总结——熔断机制

线上问题概述

节点频繁退出集群

问题复现:线上集群本身有正在recovery的分片,同时存在force merge行为。两者并发对内存压力很大,导致报CircuitBreakingException异常:

大致意思是transport request的请求过大,导致内存实际使用量超过了最大限制。从ChildMemoryCircuitBreaker这个类中找到这块异常代码:

看方法的注释是说此方法用于给parent(此处的parent指parent task)上的circuitBreaker决定是否触发breaker,也就是熔断。而从这个方法往上追溯,是以下方法在调用:

/**
* Add a number of bytes, tripping the circuit breaker if the aggregated
* estimates are above the limit. Automatically trips the breaker if the
* memory limit is set to 0. Will never trip the breaker if the limit is
* set < 0, but can still be used to aggregate estimations.
* @param bytes number of bytes to add to the breaker
* @return number of "used" bytes so far
*/
@Override
public double addEstimateBytesAndMaybeBreak(long bytes, String label) throws CircuitBreakingException {
// short-circuit on no data allowed, immediately throwing an exception
if (memoryBytesLimit == 0) {
circuitBreak(label, bytes);
}
long newUsed;
// If there is no limit (-1), we can optimize a bit by using
// .addAndGet() instead of looping (because we don't have to check a
// limit), which makes the RamAccountingTermsEnum case faster.
if (this.memoryBytesLimit == -1) {
newUsed = noLimit(bytes, label);
} else {
newUsed = limit(bytes, label);
}
// Additionally, we need to check that we haven't exceeded the parent's limit
try {
parent.checkParentLimit((long) (bytes * overheadConstant), label);
} catch (CircuitBreakingException e) {
// If the parent breaker is tripped, this breaker has to be
// adjusted back down because the allocation is "blocked" but the
// breaker has already been incremented
this.addWithoutBreaking(-bytes);
throw e;
}
return newUsed;
}

注意到parent.checkParentLimit((long) (bytes * overheadConstant), label);这一行可以看出是HierarchyCircuitBreakerService调用了checkParentLimit,HierarchyCircuitBreakerService类中可以看到很多关于索引级别的熔断参数,

其中分为fieldData,requests,accounting_requests,inflight_requests的限制。

checkParentLimit,在这里抛出了问题中的异常。

/**
* Checks whether the parent breaker has been tripped
*/
public void checkParentLimit(long newBytesReserved, String label) throws CircuitBreakingException {
final MemoryUsage memoryUsed = memoryUsed(newBytesReserved);
long parentLimit = this.parentSettings.getLimit();
if (memoryUsed.totalUsage > parentLimit) {
this.parentTripCount.incrementAndGet();
final StringBuilder message = new StringBuilder("[parent] Data too large, data for [" + label + "]" +
" would be [" + memoryUsed.totalUsage + "/" + new ByteSizeValue(memoryUsed.totalUsage) + "]" +
", which is larger than the limit of [" +
parentLimit + "/" + new ByteSizeValue(parentLimit) + "]");
if (this.trackRealMemoryUsage) {
final long realUsage = memoryUsed.baseUsage;
message.append(", real usage: [");
message.append(realUsage);
message.append("/");
message.append(new ByteSizeValue(realUsage));
message.append("], new bytes reserved: [");
message.append(newBytesReserved);
message.append("/");
message.append(new ByteSizeValue(newBytesReserved));
message.append("]");
}
message.appeRecoveryMonitornd(", usages [");
message.append(String.join(", ",
this.breakers.entrySet().stream().map(e -> {
final CircuitBreaker breaker = e.getValue();
final long breakerUsed = (long)(breaker.getUsed() * breaker.getOverhead());
return e.getKey() + "=" + breakerUsed + "/" + new ByteSizeValue(breakerUsed);
})
.collect(Collectors.toList())));
message.append("]");
// derive durability of a tripped parent breaker depending on whether the majority of memory tracked by
// child circuit breakers is categorized as transient or permanent.
CircuitBreaker.Durability durability = memoryUsed.transientChildUsage >= memoryUsed.permanentChildUsage ?
CircuitBreaker.Durability.TRANSIENT : CircuitBreaker.Durability.PERMANENT;
logger.debug("{}", message);
throw new CircuitBreakingException(message.toString(), memoryUsed.totalUsage, parentLimit, durability);
}
}

至于为什么节点会频繁退出集群,是因为ES集群会在请求中一直有异常的节点进行心跳检测,如果多个请求都频繁抛出异常,则将该节点标记为dead,从master node list中删除。这样真正的master就会检测不到该node,从而将其从集群中真正下线。待内存恢复低负载后才重新加入。

背景介绍

受CircuitBreaker real-memory limit影响,recoveries不能正常进行且节点频繁退出集群。临时将indices.breaker.total.use_real_memory设置为false并重启集群解决。在官方的release note可以看出该参数是V7.0发布的时候带上的:https://www.elastic.co/guide/en/elasticsearch/reference/7.0/breaking-changes-7.0.html

问题的详细解释:

https://discuss.elastic.co/t/what-does-this-error-mean-data-too-large-data-for-transport-request/209345

从以下几个Github上的issue可以看出和circuitBreaker相关的解释和开发过程:

https://github.com/elastic/elasticsearch/issues/44484

https://github.com/elastic/elasticsearch/pull/55566

https://github.com/elastic/elasticsearch/issues/56327

https://github.com/elastic/elasticsearch/pull/55353

原因分析

主要原因在于indices.breaker.total.use_real_memory默认true,这个在7.x之前版本没有这个参数,而由于G1GC的特性内存使用率提升导致往往会超过breaker的limit,从而触发熔断导致节点断开连接。另外ES有着自动扩展的内存大小也会长时间持有额外的内存。

比较彻底得解决方式是关闭real_memory,如果不关闭real_memory计算的话需要调整G1GC的参数(但是可能解决得不彻底,一些case还会报):

10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

7.6.2已经在release中加入这两行参数,不过仅仅jdk14以上才强制要求

参考:

https://discuss.elastic.co/t/circuitbreakingexception-parent-data-too-large-in-es-7-x/192801/7

https://github.com/elastic/elasticsearch/pull/46169

解决思路

We have already done some enhancements for high-pressure search/write caused nodes overload, if the overloaded nodes' parent breaker has been triggered, we clean related memory and fail the request as soon as possible, this would make nodes always could be garbage cleaned instead of get stuck and disconnect from cluster finally. 

可以理解为导致节点超过负荷的search或者write请求已经做了一些优化,但是一旦过载的节点的parent breaker被触发,将会清除相应内存占用并对该请求采用fail fast,这将导致节点的GC而不是假死或是从集群中失联。

分别从cancel request或者retry request入手,解决recoveries报error问题

可以看到社区中member提交的代码改动:

server/src/main/java/org/elasticsearch/tasks/TaskManager.java

先检测channel中的pending task:

然后进行原子操作后将pending task放入tracker中:

最后取消事件注册:

思路很清晰。

另外论坛中也有member提出如果内存压力持续过大,可以考虑拆分集群,这也是常见的解决线上集群负载过重的一个比较好的途径。

延展

  1. 从改进CircuitBreakerService入手:https://github.com/elastic/elasticsearch/pull/55695

  2. ZGC解决这个问题的可能性以及优劣势分析(下一篇详细介绍)



发布于: 2020 年 05 月 17 日 阅读数: 269
用户头像

罗琦

关注

后浪 2017.12.15 加入

字节跳动工程师

评论

发布
暂无评论
从一次排查ES线上问题得出的总结——熔断机制