写点什么

Elasticsearch Doc Values 和 doc_values

用户头像
escray
关注
发布于: 2021 年 02 月 27 日
Elasticsearch Doc Values 和 doc_values

Elasticsearch Doc Values 和 doc_values,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x] 和当前版本官方文档。

Doc Values Intro


  • When searching, we need to be able to map a term to a list of documents.

  • When sorting, we need to map a document to its terms. In orther words, we need to "uninvert" the invert index


This "uninverted" structure is often called a "column-store" in other systems. Essentially, it stores all the values for a single field together in a single column of data, which makes it very efficient for operations like sorting.


In Elasticsearch, this column-store is known as doc values, and is enabled by default. Doc values are created at index-time: when a field is indexed, Elasticsearch adds the tokens to the inverted index for search. But it also extracts the terms and add them to the columnar doc values.


Doc values are used in several places in Elasticsearch:


  • Sorting on a field

  • Aggregation on a field

  • Certain filters (for example, geolocation filters)

  • Scripts that refer to fields.


Because doc values are serialized to disk, we can leverage the OS to help keep access fast. When the "working set" is smaller than the available memory on a node, the OS will naturally keep all the doc values hot in memeory, leading to very fast access. When the "working set" is much larger than available memory, the OS will naturally start to page doc-values on/off disk without running into the dreaded OutOfMemory exception.


... sorting (and some other operations) happen on a parallel data structure which is built at index-time.


搜索的时候,要依靠倒排索引;排序的时候,需要依靠正排索引,看到每个 document 的每个 field,然后进行排序,所谓的正排索引,其实就是 doc values


在建立索引的时候,一方面会建立倒排索引,以供搜索用;一方面会建立正排索引,也就是 doc values,以供排序,聚合,过滤等操作使用


doc values 是被保存在磁盘上的,此时如果内存足够,os 会自动将其缓存在内存中,性能还是会很高;如果内存不足够,os 会将其写入磁盘上


doc1: hello world you and me

doc2: hi, world, how are you



word		doc1		doc2hello		 *world		 *				*you 		 *				*				and 		 *me			 *hi							  *how								*are								*
复制代码


hello you --> hello, you


hello --> doc1

you --> doc1,doc2


doc1: hello world you and me

doc2: hi, world, how are you


sort by age


doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }



document name age

doc1 jack 27

doc2 tom 30

doc_values


Doc values store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations. Doc values are supported on almost all field types, with the notable exception of text and annotated_text fields.


All fields which support doc values have them enabled by default. If you are sure that you don't need to sort or aggregate on a field, or access the field value from a script, you can disable doc value in order to save disk space.


PUT my-index-000001{  "mappings": {    "properties": {      // The status_code field has doc_values enabled by default.      "status_code": {        "type": "keyword"      },      // The session_id has doc_values disabled, but can still be queried      "session_id": {        "type": "keyword",        "doc_values": false      }    }  }}
复制代码


发布于: 2021 年 02 月 27 日阅读数: 18
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch Doc Values 和 doc_values