写点什么

Elasticsearch search scroll 游标查询

用户头像
escray
关注
发布于: 2021 年 03 月 03 日
Elasticsearch search scroll 游标查询

Elasticsearch search scroll 游标查询,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x],内容似乎有些过时,但是我觉得底层原理应该大同小异,欢迎拍砖

Scroll


Elasticsearch search scroll 游标查询,也有人翻译成滚动查询,似乎更加直观一点,但是我觉的“游标”显得更专业一些,毕竟 scroll 的执行方式有一点点类似于关系型数据库中的游标。


A scroll query is used to retrieve large numbers of documents from Elasticsearch efficiently, without paying the penalty of deep pagination.


Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more result left. It's a bit like a cursor in a traditional database.


A scrolled search takes a snapshot in time. It doesn't see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can perserve its "view" on what the index looked like at the time it started.


To scroll through results, we execute a search request and set the scroll value to the length of time we want to keep the scroll window open.


// Keep the scroll window open for  1 minute.GET /old_index/_search?scroll=1m{  "query": { "match_all": {}},  // _doc is the most efficient sort order  "sort": ["_doc"],  "size": 1000}
{ "_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoGAAAAAAAAwLVFkJrZ2tqb19LU2xHenRuWFY5bTV4YWcAAAAAAAGBRBZ6bGpxSzNONFJXdW05dEFQOWNqVlZnAAAAAAADAtYWQmtna2pvX0tTbEd6dG5YVjltNXhhZwAAAAAAAqgDFk1vaDJ5eWJCUmIyWGY5SnFJdDlqR1EAAAAAAAMC1xZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAACqAQWTW9oMnl5YkJSYjJYZjlKcUl0OWpHUQAAAAAAAwLYFkJrZ2tqb19LU2xHenRuWFY5bTV4YWcAAAAAAAMC2RZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAADAtoWQmtna2pvX0tTbEd6dG5YVjltNXhhZwAAAAAAAwLbFkJrZ2tqb19LU2xHenRuWFY5bTV4YWcAAAAAAAMC3BZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAADAt4WQmtna2pvX0tTbEd6dG5YVjltNXhhZwAAAAAAAYFFFnpsanFLM040Uld1bTl0QVA5Y2pWVmcAAAAAAAMC3RZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAACqAUWTW9oMnl5YkJSYjJYZjlKcUl0OWpHUQAAAAAAAwLfFkJrZ2tqb19LU2xHenRuWFY5bTV4YWcAAAAAAAMC4BZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAADAuEWQmtna2pvX0tTbEd6dG5YVjltNXhhZwAAAAAAAqgGFk1vaDJ5eWJCUmIyWGY5SnFJdDlqR1EAAAAAAAMC4hZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAADAuMWQmtna2pvX0tTbEd6dG5YVjltNXhhZwAAAAAAAqgHFk1vaDJ5eWJCUmIyWGY5SnFJdDlqR1EAAAAAAAMC5BZCa2dram9fS1NsR3p0blhWOW01eGFnAAAAAAABgUYWemxqcUszTjRSV3VtOXRBUDljalZWZw==", "took" : 220, "timed_out" : false, "_shards" : { "total" : 24, "successful" : 24, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 292441, "relation" : "eq" },
复制代码


不知道是否是因为版本不同,我这里了获得的 scroll_id 明显要比文档中的长一些。


The response to this request includes a srollid, which is a long Base-64 encoded string. Now we can pass the scorllid to the _search/scroll endpoint to retrieve the next batch of results:


GET /_search/scroll{  // set the scroll expiration to 1m again  "scroll": "1m",  "scroll_id": "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs=" }
复制代码


The response to this request includes the next batch of results. Although we specified a size of 1,000, we get back many more documents. When scanning, the size is applied to each shard, so you will get back a maximum of size * number_of_primary_shards documents in each batch.


如果一次性要查出来比如 10 万条数据,那么性能会很差,此时一般会采取用 scoll 滚动查询,一批一批的查,直到所有数据都查询完处理完


使用 scoll 滚动搜索,可以先搜索一批数据,然后下次再搜索一批数据,以此类推,直到搜索出全部的数据来


scoll 搜索会在第一次搜索的时候,保存一个当时的视图快照,之后只会基于该旧的视图快照提供数据搜索,如果这个期间数据变更,是不会让用户看到的


采用基于_doc 进行排序的方式,性能较高


每次发送 scroll 请求,我们还需要指定一个 scoll 参数,指定一个时间窗口,每次搜索请求只要在这个时间窗口内能完成就可以了


GET /test_index/test_type/_search?scroll=1m{  "query": {    "match_all": {}  },  "sort": [ "_doc" ],  "size": 3}
{ "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3", "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 },
复制代码


获得的结果会有一个 scoll_id,下一次再发送 scoll 请求的时候,必须带上这个 scoll_id


GET /_search/scroll{    "scroll": "1m",     "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3"}
复制代码


scroll,看起来挺像分页的,但是其实使用场景不一样。


  • 分页主要是用来一页一页搜索,给用户看的;

  • scroll 主要是用来一批一批检索数据,让系统进行处理的


发布于: 2021 年 03 月 03 日阅读数: 22
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch search scroll 游标查询