Elasticsearch 近实时搜索 Near Real-Time Search

Elasticsearch 近实时搜索 Near Real-Time Search(refresh),内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]
优化写入流程实现 NRT 近实时
现有流程的问题,每次都必须等待 fsync 将 segment 刷入磁盘,才能将 segment 打开供 search 使用,这样的话,从一个 document 写入,到它可以被搜索,可能会超过 1 分钟,这就不是近实时的搜索了!主要瓶颈在于 fsync 时磁盘 IO 写数据进磁盘,是很耗时的。
 
 原有流程
- 一条 document 写入内存 buffer 中 
- 每到一定时间,buffer 中的数据写入一个新的 index segment 中 
- index segment 中的数据先写入 OS cache 
- OS cache 中的 index segment 被 fsync 到磁盘,然后 index segment 打开
这个时候,改进流程的地方到了,ES 不会等待 fsync 将 OS Cache 中的数据刷入 OS Disk,才将 index segment 打开供 search 使用,而是 index segment 数据一到 OS Cache 中,就立即打开,供 search 使用。
每秒,buffer 被刷新到一个新的 index segment 中,所以每秒都会产生一个新的 index segment file
写入流程别改进如下
- 数据写入 buffer 
- 每隔一定时间,buffer 中的数据被写入 segment 文件,但是先写入 os cache 
- 只要 segment 写入 os cache,那就直接打开供 search 使用,不立即执行 commit 
数据写入 os cache,并被打开供搜索的过程,叫做 refresh,默认是每隔 1 秒 refresh 一次。也就是说,每隔一秒就会将 buffer 中的数据写入一个新的 index segment file,先写入 os cache 中。所以,es 是近实时的,数据写入到可以被搜索,默认是 1 秒。
可以手动 refresh
一般不需要手动执行,没必要,让 es 自己搞就可以了
比如说,我们现在的时效性要求,比较低,只要求一条数据写入 es,一分钟以后才让我们搜索到就可以了,那么就可以调整 refresh interval
那么,问题来了。Commit 操作在哪里?且听下回分解。
近实时搜索
Committing a new segment to disk requires an fsync to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure. But an fsync is costly, ...
 
 A Lucene index with new documents in the in-memory buffer
documents in the in-memory indexing buffer are written to a new segment. The new segment is written to the filesystem cache first - which is cheap - and only later is it flushed to disk - which is expensive. But once a file is in the cache, it can be opened and read, just like any other file.
Lucene allows new segments to be written and opened - making the documents they contain visible to search - without performing a full commit.
 
 The buffer contents have been written to a segment, which is searchable, but is not yet committed
refresh API
In Elasticsearch, this lightweight process of writing and opening a new segment is called a refresh. By default, every shard is refreshed automatically once every second. This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within 1 second.
While a refresh is much lighter than a commit, it still has a performance cost.
Not all use cases require a refresh every second... You can reduce the frequency of refreshes on a per-index basis by setting the refresh_interval:
The refresh_interval can be updated dynamically on an existing index.
The refresh_interval expects a duration such as 1s (1 second) or 2m (2 minutes). An absolute number like 1 means 1 millisecond - a sure way to bring your cluster to its knees.
我觉的最有价值的地方是这里的调优思路,找到瓶颈 fsync,然后想办法优化。
版权声明: 本文为 InfoQ 作者【escray】的原创文章。
原文链接:【http://xie.infoq.cn/article/4d8e8f3ab63f734b7c6578403】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。












 
    
评论