elasticsearch 游标 使用

发布于: 2020 年 07 月 22 日

碰到一个比较头疼的问题,MySQL数据丢失。

有两个办法,一个办法是让DBA找半年前的数据。另一个办法是保存了MySQL数据的ES里找。

由于数据量过万,而且ES设置了一次查询数据量最大10000,想想用 scroll 取数据会比较好。



(1) ElasticSearch 2.x



(1.1) 查询索引有多少数据



localhost:9200/_nodes/stats/indices/search?pretty



(1.2) 查看索引信息



curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?pretty'



(1.3) 使用游标



curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?scroll=10m' -d '
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 10000
}' >> es_scroll_data_20190118_1w.txt



<!--more-->



(1.4) 不断取下一页



curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d '
{
"scroll": "10m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_2w.txt

curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d '
{
"scroll": "10m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_3w.txt

(2) ElasticSearch 5.6.x



(2.1) 查询索引信息



localhost:9200/_nodes/stats/indices/search?pretty



curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?pretty'



(2.2) 使用游标



curl -XGET 'http://127.0.0.1:9400/dev_index1_20190118/docs/_search?scroll=10m' -d '
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 10000
}' >> es_scroll_data_20190118_1w.txt



(2.3) 不断取下一页



curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d '
{
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_2w.txt



curl -XGET 'http://127.0.0.1:9400/_search?scroll=10m' -d '
{
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAANKLTFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADSi1BY3X1Z6N2NoRlNQaTlGLTFueDk1d0xBAAAAAAA0otYWN19WejdjaEZTUGk5Ri0xbng5NXdMQQAAAAAANKLVFjdfVno3Y2hGU1BpOUYtMW54OTV3TEEAAAAAADvJpxZzcU9YSExnLVRTNk5RY3JfMlNuWU9n"
}' >> es_scroll_data_20190118_3w.txt



(3) 遇到的问题



(3.1) Unknown key for a VALUESTRING in [scrollid].



{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Unknown key for a VALUE_STRING in [scroll_id].",
"line": 3,
"col": 19
}
],
"type": "parsing_exception",
"reason": "Unknown key for a VALUE_STRING in [scroll_id].",
"line": 3,
"col": 19
},
"status": 400
}



第二次使用的 scrollid 和第一次返回的 scrollid 不一致导致



(3.2) Unknown key for a VALUE_STRING in [scroll]



{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Unknown key for a VALUE_STRING in [scroll].",
"line": 3,
"col": 15
}
],
"type": "parsing_exception",
"reason": "Unknown key for a VALUE_STRING in [scroll].",
"line": 3,
"col": 15
},
"status": 400
}



第二次请求时 请求参数里多了 scroll 参数



(3.3) Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.maxresultwindow] index level setting.



{
"error": {
"root_cause": [
{
"type": "query_phase_execution_exception",
"reason": "Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "dev_index1_20190118",
"node": "8XqKY198S823M78QA43F8g",
"reason": {
"type": "query_phase_execution_exception",
"reason": "Batch size is too large, size must be less than or equal to: [10000] but was [1000000]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."
}
}
]
},
"status": 500
}



设置的 size 过大,超过10000,配置文件里 index.maxresultwindow 最大为10000



(3.4) searchcontextmissing_exception



{
"error": {
"root_cause": [
{
"type": "search_context_missing_exception",
"reason": "No search context found for id [3540965]"
},
{
"type": "search_context_missing_exception",
"reason": "No search context found for id [3922089]"
},
{
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454995]"
},
{
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454996]"
},
{
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454994]"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3540965]"
}
},
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3922089]"
}
},
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454995]"
}
},
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454996]"
}
},
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454994]"
}
}
],
"caused_by": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [3454994]"
}
},
"status": 404
}



其实是超时了,scroll自动删除了

(4) scroll原理



ES的搜索是分2个阶段进行的,即Query阶段和Fetch阶段。 Query阶段比较轻量级,通过查询倒排索引,获取满足查询结果的文档ID列表。 而Fetch阶段比较重,需要将每个shard的结果取回,在协调结点进行全局排序。 通过From+size这种方式分批获取数据的时候,随着from加大,需要全局排序并丢弃的结果数量随之上升,性能越来越差。

而Scroll查询,先做轻量级的Query阶段以后,免去了繁重的全局排序过程。 它只是将查询结果集,也就是doc id列表保留在一个上下文里, 之后每次分批取回的时候,只需根据设置的size,在每个shard内部按照一定顺序(默认doc_id续), 取回这个size数量的文档即可。

由此也可以看出scroll不适合支持那种实时的和用户交互的前端分页工作,其主要用途用于从ES集群分批拉取大量结果集的情况,一般都是offline的应用场景。 比如需要将非常大的结果集拉取出来,存放到其他系统处理,或者需要做大索引的reindex等等。



References

[1] 游标查询

[2] scroll

[3] elasticsearch scroll查询的原理没太懂

[4] 取回阶段

[5] java-rest-high-search-scroll



发布于: 2020 年 07 月 22 日 阅读数: 4
用户头像

wkq2786130

关注

hello 2018.09.28 加入

http://weikeqin.com/

评论

发布
暂无评论
elasticsearch 游标 使用