写点什么

Elasticsearch 近实时搜索 Near Real-Time Search

用户头像
escray
关注
发布于: 2021 年 03 月 13 日
Elasticsearch 近实时搜索 Near Real-Time Search

Elasticsearch 近实时搜索 Near Real-Time Search(refresh),内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]

优化写入流程实现 NRT 近实时


现有流程的问题,每次都必须等待 fsync 将 segment 刷入磁盘,才能将 segment 打开供 search 使用,这样的话,从一个 document 写入,到它可以被搜索,可能会超过 1 分钟,这就不是近实时的搜索了!主要瓶颈在于 fsync 时磁盘 IO 写数据进磁盘,是很耗时的。



原有流程


  1. 一条 document 写入内存 buffer 中

  2. 每到一定时间,buffer 中的数据写入一个新的 index segment 中

  3. index segment 中的数据先写入 OS cache

  4. OS cache 中的 index segment 被 fsync 到磁盘,然后 index segment 打开


这个时候,改进流程的地方到了,ES 不会等待 fsync 将 OS Cache 中的数据刷入 OS Disk,才将 index segment 打开供 search 使用,而是 index segment 数据一到 OS Cache 中,就立即打开,供 search 使用。


每秒,buffer 被刷新到一个新的 index segment 中,所以每秒都会产生一个新的 index segment file


写入流程别改进如下


  1. 数据写入 buffer

  2. 每隔一定时间,buffer 中的数据被写入 segment 文件,但是先写入 os cache

  3. 只要 segment 写入 os cache,那就直接打开供 search 使用,不立即执行 commit


数据写入 os cache,并被打开供搜索的过程,叫做 refresh,默认是每隔 1 秒 refresh 一次。也就是说,每隔一秒就会将 buffer 中的数据写入一个新的 index segment file,先写入 os cache 中。所以,es 是近实时的,数据写入到可以被搜索,默认是 1 秒。


可以手动 refresh


POST /my_index/_refresh
复制代码


一般不需要手动执行,没必要,让 es 自己搞就可以了


比如说,我们现在的时效性要求,比较低,只要求一条数据写入 es,一分钟以后才让我们搜索到就可以了,那么就可以调整 refresh interval


PUT /my_index{  "settings": {    "refresh_interval": "30s"  }}
复制代码


那么,问题来了。Commit 操作在哪里?且听下回分解。

近实时搜索

Near Real-Time Search


Committing a new segment to disk requires an fsync to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure. But an fsync is costly, ...



A Lucene index with new documents in the in-memory buffer


documents in the in-memory indexing buffer are written to a new segment. The new segment is written to the filesystem cache first - which is cheap - and only later is it flushed to disk - which is expensive. But once a file is in the cache, it can be opened and read, just like any other file.


Lucene allows new segments to be written and opened - making the documents they contain visible to search - without performing a full commit.



The buffer contents have been written to a segment, which is searchable, but is not yet committed

refresh API


In Elasticsearch, this lightweight process of writing and opening a new segment is called a refresh. By default, every shard is refreshed automatically once every second. This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within 1 second.


// Refresh all indicesPOST /_refresh// Refresh just the blogs indexPOST /blogs/_refresh
复制代码


While a refresh is much lighter than a commit, it still has a performance cost.


Not all use cases require a refresh every second... You can reduce the frequency of refreshes on a per-index basis by setting the refresh_interval:


PUT /my_logs{  "settings": {    // Refresh the my_logs index every 30 seconds    "refresh_interval": "30s"  }}
复制代码


The refresh_interval can be updated dynamically on an existing index.


// Disable automatic refreshesPUT /my_logs/_settings{ "refresh_interval": -1 }
// Refresh automatically every secondPUT /my_logs/_settings{ "refresh_interval": "1s" }
复制代码


The refresh_interval expects a duration such as 1s (1 second) or 2m (2 minutes). An absolute number like 1 means 1 millisecond - a sure way to bring your cluster to its knees.


我觉的最有价值的地方是这里的调优思路,找到瓶颈 fsync,然后想办法优化。


发布于: 2021 年 03 月 13 日阅读数: 15
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch 近实时搜索 Near Real-Time Search