elasticsearch 游标使用

关注
发布于: 2020 年 07 月 22 日
碰到一个比较头疼的问题，MySQL数据丢失。
  有两个办法，一个办法是让DBA找半年前的数据。另一个办法是保存了MySQL数据的ES里找。
  由于数据量过万，而且ES设置了一次查询数据量最大10000，想想用 scroll 取数据会比较好。
﻿
(1) ElasticSearch 2.x﻿
(1.1) 查询索引有多少数据﻿
   localhost:9200/_nodes/stats/indices/search?pretty 
﻿
(1.2) 查看索引信息﻿
   curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?pretty' 
﻿
(1.3) 使用游标﻿
curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?scroll=10m' -d ' 
{ 
    "query": { "match_all": {}},
    "sort" : ["_doc"], 
    "size":  10000
}'  >> es_scroll_data_20190118_1w.txt
﻿
<!--more-->  
﻿
(1.4) 不断取下一页﻿
curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d ' 
{ 
    "scroll": "10m",
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_2w.txt
  
curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d ' 
{ 
    "scroll": "10m",
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_3w.txt
  
(2) ElasticSearch 5.6.x﻿
(2.1) 查询索引信息﻿
   localhost:9200/_nodes/stats/indices/search?pretty 
﻿
   curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?pretty' 
﻿
(2.2) 使用游标﻿
curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?scroll=10m' -d ' 
{ 
    "query": { "match_all": {}},
    "sort" : ["_doc"], 
    "size":  10000
}'  >> es_scroll_data_20190118_1w.txt
﻿
(2.3) 不断取下一页﻿
curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d ' 
{ 
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_2w.txt
﻿
curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d ' 
{ 
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_3w.txt
﻿
(3) 遇到的问题﻿
(3.1) Unknown key for a VALUESTRING in [scrollid].﻿
{
    "error": {
        "root_cause": [
            {
                "type": "parsing_exception",
                "reason": "Unknown key for a VALUE_STRING in [scroll_id].",
                "line": 3,
                "col": 19
            }
        ],
        "type": "parsing_exception",
        "reason": "Unknown key for a VALUE_STRING in [scroll_id].",
        "line": 3,
        "col": 19
    },
    "status": 400
}
﻿
  第二次使用的 scrollid 和第一次返回的 scrollid 不一致导致
﻿
(3.2) Unknown key for a VALUE_STRING in [scroll]﻿
{
    "error": {
        "root_cause": [
            {
                "type": "parsing_exception",
                "reason": "Unknown key for a VALUE_STRING in [scroll].",
                "line": 3,
                "col": 15
            }
        ],
        "type": "parsing_exception",
        "reason": "Unknown key for a VALUE_STRING in [scroll].",
        "line": 3,
        "col": 15
    },
    "status": 400
}
﻿
  第二次请求时 请求参数里多了 scroll 参数
﻿
(3.3) Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.maxresultwindow] index level setting.﻿
{
    "error": {
        "root_cause": [
            {
                "type": "query_phase_execution_exception",
                "reason": "Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "dev_index1_20190118",
                "node": "8XqKY198S823M78QA43F8g",
                "reason": {
                    "type": "query_phase_execution_exception",
                    "reason": "Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."
                }
            }
        ]
    },
    "status": 500
}
﻿
  设置的 size 过大，超过10000，配置文件里 index.maxresultwindow 最大为10000
﻿
(3.4) searchcontextmissing_exception﻿
{
    "error": {
        "root_cause": [
            {
                "type": "search_context_missing_exception",
                "reason": "No search context found for id [3540965]"
            },
            {
                "type": "search_context_missing_exception",
                "reason": "No search context found for id [3922089]"
            },
            {
                "type": "search_context_missing_exception",
                "reason": "No search context found for id [3454995]"
            },
            {
                "type": "search_context_missing_exception",
                "reason": "No search context found for id [3454996]"
            },
            {
                "type": "search_context_missing_exception",
                "reason": "No search context found for id [3454994]"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": -1,
                "index": null,
                "reason": {
                    "type": "search_context_missing_exception",
                    "reason": "No search context found for id [3540965]"
                }
            },
            {
                "shard": -1,
                "index": null,
                "reason": {
                    "type": "search_context_missing_exception",
                    "reason": "No search context found for id [3922089]"
                }
            },
            {
                "shard": -1,
                "index": null,
                "reason": {
                    "type": "search_context_missing_exception",
                    "reason": "No search context found for id [3454995]"
                }
            },
            {
                "shard": -1,
                "index": null,
                "reason": {
                    "type": "search_context_missing_exception",
                    "reason": "No search context found for id [3454996]"
                }
            },
            {
                "shard": -1,
                "index": null,
                "reason": {
                    "type": "search_context_missing_exception",
                    "reason": "No search context found for id [3454994]"
                }
            }
        ],
        "caused_by": {
            "type": "search_context_missing_exception",
            "reason": "No search context found for id [3454994]"
        }
    },
    "status": 404
}
﻿
  其实是超时了，scroll自动删除了
  
(4) scroll原理﻿
 ES的搜索是分2个阶段进行的，即Query阶段和Fetch阶段。  Query阶段比较轻量级，通过查询倒排索引，获取满足查询结果的文档ID列表。  而Fetch阶段比较重，需要将每个shard的结果取回，在协调结点进行全局排序。  通过From+size这种方式分批获取数据的时候，随着from加大，需要全局排序并丢弃的结果数量随之上升，性能越来越差。
  
 而Scroll查询，先做轻量级的Query阶段以后，免去了繁重的全局排序过程。 它只是将查询结果集，也就是doc id列表保留在一个上下文里， 之后每次分批取回的时候，只需根据设置的size，在每个shard内部按照一定顺序（默认doc_id续)， 取回这个size数量的文档即可。 
 
 由此也可以看出scroll不适合支持那种实时的和用户交互的前端分页工作，其主要用途用于从ES集群分批拉取大量结果集的情况，一般都是offline的应用场景。  比如需要将非常大的结果集拉取出来，存放到其他系统处理，或者需要做大索引的reindex等等。
﻿
References[1]  游标查询   
[2]  scroll 
[3] elasticsearch scroll查询的原理没太懂
[4] 取回阶段
[5] java-rest-high-search-scroll
﻿
发布于: 2020 年 07 月 22 日阅读数: 50
原文链接:【http://xie.infoq.cn/article/07a9caa56088b41721b692756】。文章转载请联系作者。
wkq2786130

关注
hello 2018.09.28 加入
http://weikeqin.com/
发布
暂无评论
创作场景

elasticsearch 游标 使用

(1) ElasticSearch 2.x

(1.1) 查询索引有多少数据

(1.2) 查看索引信息

(1.3) 使用游标

(1.4) 不断取下一页

(2) ElasticSearch 5.6.x

(2.1) 查询索引信息

(2.2) 使用游标

(2.3) 不断取下一页

(3) 遇到的问题

(3.1) Unknown key for a VALUESTRING in [scrollid].

(3.2) Unknown key for a VALUE_STRING in [scroll]

(3.3) Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.maxresultwindow] index level setting.

(3.4) searchcontextmissing_exception

(4) scroll原理

References

wkq2786130

评论

elasticsearch 游标使用