写点什么

Elasticsearch Inverted Index

用户头像
escray
关注
发布于: 2021 年 03 月 11 日
Elasticsearch Inverted Index

Elasticsearch Inverted Index,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]

Inverted Index

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/inverted-index.html


two documents, each with a content field containing the following:


  1. The quick brown fox jumped over the lazy dog

  2. Quick brown foxes leap over lazy dogs in summer


first, split the content field of each document into separate words (which we call term, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:


Term      Doc_1  Doc_2-------------------------Quick   |       |  XThe     |   X   |brown   |   X   |  Xdog     |   X   |dogs    |       |  Xfox     |   X   |foxes   |       |  Xin      |       |  Xjumped  |   X   |lazy    |   X   |  Xleap    |       |  Xover    |   X   |  Xquick   |   X   |summer  |       |  Xthe     |   X   |------------------------
复制代码


if we want to search for quick brown, we just need to find the documents in which each term appears:


Term      Doc_1  Doc_2-------------------------brown   |   X   |  Xquick   |   X   |------------------------Total   |   2   |  1
复制代码


Both documents match, but the first document has more matches than the second.


after normalize the terms into a standard format (lowercased, stemmed, synonyms...)


Term      Doc_1  Doc_2-------------------------brown   |   X   |  Xdog     |   X   |  Xfox     |   X   |  Xin      |       |  Xjump    |   X   |  Xlazy    |   X   |  Xover    |   X   |  Xquick   |   X   |  Xsummer  |       |  Xthe     |   X   |  X------------------------
复制代码


You can find only terms that exist in your index, so both only indexed text and the query string must be normalized into the same form.


This process of tokenization and normalization is called analysis.

Making Text Searchable

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/making-text-searchable.html


倒排索引是适合用于进行搜索的


The data structure that best supports the multiple-values-per-field requirement is the inverted index.


The inverted index contains a sorted list of all the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.


Term  | Doc 1 | Doc 2 | Doc 3 | ...------------------------------------brown |   X   |       |  X    | ...fox   |   X   |   X   |  X    | ...quick |   X   |   X   |       | ...the   |   X   |       |  X    | ...
复制代码


historically, an inverted index was used to index whole unstructured text documents. A document in Elasticsearch is a structured JSON document with fields and values. In reality, every indexed field in a JSON document has its own inverted index


倒排索引的结构


  1. 包含这个关键词的 document list

  2. 包含这个关键词的所有 document 的数量:IDF(inverse document frequency)

  3. 这个关键词在每个 document 中出现的次数:TF(term frequency)

  4. 这个关键词在这个 document 中的次序

  5. 每个 document 的长度:length norm

  6. 包含这个关键词的所有 document 的平均长度


word		doc1		doc2dog		   *		   *hello		 *you				       *
复制代码


the inverted index needs to know about all documents in the collection in order for it to function as intended.


Immutablity


The inverted index that is written to disk is immutable: it doesn't change. Ever.


  1. There is no need for locking.

  2. Once the index has been read into the kernel's filesystem cache, it stays there, because it never changes. As long as there is enough space in the filesystem cache, most reads will come from memory instead of having to hit disk.

  3. Any other caches (like the filter cache) remain valid for the life of the index.

  4. Writing a single large inverted index allows the data to be compressed, reducing costly disk I/O and the amount of RAM needed to cache the index.


倒排索引不可变的好处


  1. 不需要锁,提升并发能力,避免锁的问题

  2. 数据不变,一直保存在 os cache 中,只要 cache 内存足够

  3. filter cache 一直驻留在内存,因为数据不变

  4. 可以压缩,节省 cpu 和 io 开销


倒排索引不可变的坏处:每次都要重新构建整个索引


downsides: You can't change it. If you want to make the new documents searchable, you have to rebuild the entired index.


发布于: 2021 年 03 月 11 日阅读数: 14
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch Inverted Index