Elasticsearch 近实时搜索 Near Real-Time Search
Elasticsearch 近实时搜索 Near Real-Time Search(refresh),内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]
优化写入流程实现 NRT 近实时
现有流程的问题,每次都必须等待 fsync 将 segment 刷入磁盘,才能将 segment 打开供 search 使用,这样的话,从一个 document 写入,到它可以被搜索,可能会超过 1 分钟,这就不是近实时的搜索了!主要瓶颈在于 fsync 时磁盘 IO 写数据进磁盘,是很耗时的。
原有流程
一条 document 写入内存 buffer 中
每到一定时间,buffer 中的数据写入一个新的 index segment 中
index segment 中的数据先写入 OS cache
OS cache 中的 index segment 被 fsync 到磁盘,然后 index segment 打开
这个时候,改进流程的地方到了,ES 不会等待 fsync 将 OS Cache 中的数据刷入 OS Disk,才将 index segment 打开供 search 使用,而是 index segment 数据一到 OS Cache 中,就立即打开,供 search 使用。
每秒,buffer 被刷新到一个新的 index segment 中,所以每秒都会产生一个新的 index segment file
写入流程别改进如下
数据写入 buffer
每隔一定时间,buffer 中的数据被写入 segment 文件,但是先写入 os cache
只要 segment 写入 os cache,那就直接打开供 search 使用,不立即执行 commit
数据写入 os cache,并被打开供搜索的过程,叫做 refresh,默认是每隔 1 秒 refresh 一次。也就是说,每隔一秒就会将 buffer 中的数据写入一个新的 index segment file,先写入 os cache 中。所以,es 是近实时的,数据写入到可以被搜索,默认是 1 秒。
可以手动 refresh
一般不需要手动执行,没必要,让 es 自己搞就可以了
比如说,我们现在的时效性要求,比较低,只要求一条数据写入 es,一分钟以后才让我们搜索到就可以了,那么就可以调整 refresh interval
那么,问题来了。Commit 操作在哪里?且听下回分解。
近实时搜索
Committing a new segment to disk requires an fsync to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure. But an fsync is costly, ...
A Lucene index with new documents in the in-memory buffer
documents in the in-memory indexing buffer are written to a new segment. The new segment is written to the filesystem cache first - which is cheap - and only later is it flushed to disk - which is expensive. But once a file is in the cache, it can be opened and read, just like any other file.
Lucene allows new segments to be written and opened - making the documents they contain visible to search - without performing a full commit.
The buffer contents have been written to a segment, which is searchable, but is not yet committed
refresh API
In Elasticsearch, this lightweight process of writing and opening a new segment is called a refresh. By default, every shard is refreshed automatically once every second. This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within 1 second.
While a refresh is much lighter than a commit, it still has a performance cost.
Not all use cases require a refresh every second... You can reduce the frequency of refreshes on a per-index basis by setting the refresh_interval:
The refresh_interval can be updated dynamically on an existing index.
The refresh_interval expects a duration such as 1s (1 second) or 2m (2 minutes). An absolute number like 1 means 1 millisecond - a sure way to bring your cluster to its knees.
我觉的最有价值的地方是这里的调优思路,找到瓶颈 fsync,然后想办法优化。
版权声明: 本文为 InfoQ 作者【escray】的原创文章。
原文链接:【http://xie.infoq.cn/article/4d8e8f3ab63f734b7c6578403】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。
评论