Elasticsearch Inverted Index
Elasticsearch Inverted Index,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]
Inverted Index
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/inverted-index.html
two documents, each with a content field containing the following:
The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer
first, split the content field of each document into separate words (which we call term, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:
if we want to search for quick brown, we just need to find the documents in which each term appears:
Both documents match, but the first document has more matches than the second.
after normalize the terms into a standard format (lowercased, stemmed, synonyms...)
You can find only terms that exist in your index, so both only indexed text and the query string must be normalized into the same form.
This process of tokenization and normalization is called analysis.
Making Text Searchable
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/making-text-searchable.html
倒排索引是适合用于进行搜索的
The data structure that best supports the multiple-values-per-field requirement is the inverted index.
The inverted index contains a sorted list of all the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
historically, an inverted index was used to index whole unstructured text documents. A document in Elasticsearch is a structured JSON document with fields and values. In reality, every indexed field in a JSON document has its own inverted index
倒排索引的结构
包含这个关键词的 document list
包含这个关键词的所有 document 的数量:IDF(inverse document frequency)
这个关键词在每个 document 中出现的次数:TF(term frequency)
这个关键词在这个 document 中的次序
每个 document 的长度:length norm
包含这个关键词的所有 document 的平均长度
the inverted index needs to know about all documents in the collection in order for it to function as intended.
Immutablity
The inverted index that is written to disk is immutable: it doesn't change. Ever.
There is no need for locking.
Once the index has been read into the kernel's filesystem cache, it stays there, because it never changes. As long as there is enough space in the filesystem cache, most reads will come from memory instead of having to hit disk.
Any other caches (like the filter cache) remain valid for the life of the index.
Writing a single large inverted index allows the data to be compressed, reducing costly disk I/O and the amount of RAM needed to cache the index.
倒排索引不可变的好处
不需要锁,提升并发能力,避免锁的问题
数据不变,一直保存在 os cache 中,只要 cache 内存足够
filter cache 一直驻留在内存,因为数据不变
可以压缩,节省 cpu 和 io 开销
倒排索引不可变的坏处:每次都要重新构建整个索引
downsides: You can't change it. If you want to make the new documents searchable, you have to rebuild the entired index.
版权声明: 本文为 InfoQ 作者【escray】的原创文章。
原文链接:【http://xie.infoq.cn/article/5501ae1652c9adac088c10646】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。
评论