Elasticsearch 写入流程 Making Changes Persistent
Elasticsearch 写入流程 Making Changes Persistent(translog、flush),内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]
之前的写入流程,如果遇到服务器宕机,那么 OS Cache、index segment 和 内存 buffer 中的数据都会丢失,可靠性很差。
看图说话
将 document 写入内存 buffer 缓冲,同时将 document 写入 translog 日志文件(落盘?)
每隔 1 秒,数据从 buffer 被写入 index segment file,然后 segment file 进入 OS Cache,打开 index segment,供 search 使用
search 请求可以搜索到在这个 OS Cache 中的 index segment file
segment 写入 OS Cache 后,内存 buffer 被清空
重复执行上面的操作,随着时间的推移,translog 文件会不断的变大,当大到一定程度,就会触发 flush 操作(也就是 commit 操作)
将 buffer 中现有的数据全部写入一个 segment file,并刷入 OS Cache,打开供 search
写一个 commit point 到磁盘上,标明有哪些 index segment
用 fsync 将数据强行刷到 OS disk 磁盘上,所有 OS Cache 中的数据都会被刷(落盘)
现有 translog 日志文件被清空
再次优化的写入流程
数据写入 buffer 缓冲和 translog 日志文件
每隔一秒钟,buffer 中的数据被写入新的 segment file,并进入 os cache,此时 segment 被打开并供 search 使用
buffer 被清空
重复 1~3,新的 segment 不断添加,buffer 不断被清空,而 translog 中的数据不断累加
当 translog 长度达到一定程度的时候,commit 操作发生
buffer 中的所有数据写入一个新的 segment,并写入 os cache,打开供使用
buffer 被清空
一个 commit ponit 被写入磁盘,标明了所有的 index segment
filesystem cache 中的所有 index segment file 缓存数据,被 fsync 强行刷到磁盘上
现有的 translog 被清空,创建一个新的 translog
基于 translog 和 commit point,如何进行数据恢复
OS Cache 中囤积了一些数据,但是此时不巧宕机了,OS Cache 中的数据全部丢失了,那么我们怎么进行数据恢复呢?
translog 中存储了上一次 flush (commit point) 知道现在最近的数据变更记录
OS Disk 上存放了上一次 Commit point 为止,所有的 segment file 都 fsync 到了磁盘上
机器重启后,Disk 上的数据并没有丢失,此时就会将 translog 文件中的变更记录进行回放,重新执行之前的各种操作,在 buffer 中执行,再重新刷到一个一个的 segment 到 OS Cache 中,等待下一次 commit 发生即可
fsync + 清空 translog,就是 flush,默认每隔 30 分钟 flush 一次;或者当 translog 过大的时候,也会 flush
一般来说别手动 flush,让它自动执行就可以了
translog,每隔 5 秒被 fsync 一次到磁盘上。在一次增删改操作之后,当 fsync 在 primary shard 和 replica shard 都成功之后,那次增删改操作才会成功
但是这种在一次增删改时强行 fsync translog 可能会导致部分操作比较耗时,也可以允许部分数据丢失,设置异步 fsync translog
Making Changes Persistent
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/translog.html
a full commit point, which lists all known segments. Elasticsearch uses this commit point during startup or when reopening an index to decide which segments belong to the current shard.
When document is indexed, it is added to the in-memory buffer and appended to the translog
New documents are added to the in-memory buffer and appended to the translog
The refresh leaves the shard in the state... Once every second, the shard is refreshed.
The docs in the in-memory buffer are written to a new segment, without an fsync
The segment is opened to make it visible to search
The in-memory buffer is cleared.
After a refresh, the buffer is cleared but the transaction log is not
This process continues with more documents being added to the in-memory buffer and appended to the transaction log
The transaction log keeps accumulating documents
Every so often - such as when the translog is getting too big - the index is flushed; a new translog is created, and a full commit is performed.
Any docs in the in-memroy buffer are written to a new segment.
The buffer is cleared
A commit point is written to disk
The filesystem cache is flushed with an fsync
The old translog is deleted.
After a flush, the segments are fully committed and the transaction log is cleared.
The translog provides a persistent record of all operations that have not yet been flushed to disk.
The translog is also used to provide real-time CRUD. When you try to retrieve, update, or delete a document by ID, it first checks the translog for any recent changes before trying to retrieve the document from the relevant segment. This means that it always has access to the latest known version of the document, in real-time.
flush API
the action of performing a commit and truncating the translog
Shards are flushed automatically every 30 minutes, or when the translog becomes too big.
You seldom need to issue a manual flush yourself; usually, automatic flushing is all that is required.
it is beneficial to flush your indices before restarting a node or closing an index.
How Safe Is the Translog?
The purpose of the translog is to ensure that operations are not lost.
Writes to a file will not survive a reboot until the file has been fsync'ed to disk. By default, the translog is fsync'ed every 5 seconds and after a write request completes....your client won't receive a 200 OK response until the entire request has been fsync'ed in the translog of the primary and all replicas.
for some high-volume clusters where losing a few seconds of data is not critical, it can be advantageous to fsync asynchronously.
This setting can be configured per-index and is dynamically updatable.
版权声明: 本文为 InfoQ 作者【escray】的原创文章。
原文链接:【http://xie.infoq.cn/article/848736815574b467babbbce00】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。
评论