写点什么

002 ES NGram 分词 + suggest

用户头像
小林-1025
关注
发布于: 2021 年 04 月 29 日

我们继续解决我们之前的需求,此功能可满足以下智能纠错+查询功能需求


需求描述:

    在输入“使用手川”,即在输入错误的情况下,实现的检索结果需要包括“使用手册”和“使用手川”

    在输入“马央九”,即在输入错误的情况下,实现的检索结果需要包括“马英九”和“马央九”

解决方案:

首先,对目标字段使用 NGram 进行分词,然后使用 term suggest 拿到 es 根据索引给出的 suggest 结果

再次,使用“检索关键词”+上一步返回的 options,进行最大范围的 term 检索

遗留问题:

(1)对单个词给的 options 并不理想

(2)通过_analyze 可以知道 ngram 分词会导致大量的 term,index 占用空间比较大

遗留问题方案:

(1)因为此次需求的业务方并没有精准单一词汇的检索需求,最后,业务方在关键字上做了处理

(2)业务方数据量并不是很大,而且属于离线任务


总结:

(1)假如单纯实现在检索框中输入文字,然后给出下拉提示的功能,可以对用户的检索词进行记录,然后通过 prefix completion 功能完成

(2)为什么不是采用 ik_max_word,然后检索的时候经过分词不就可以达到最大范围的覆盖查询了吗?原因有以下两点

a. 先从分词器上看,"使用手册"的分词结果包括“使用”“用手”,在几十万文档的情况下,包括这类关键词的返回 hit 非常多,然而这不是实际需要的数据,所以我们在查询上需要使用精准的 term 查询,返回的数据在业务上才可以使用

b. 再者,业务上需要检索的关键字并不一定在 ik 的分词范围之内,不太可能每次去修改 dict 然后进行索引重建


  1. 建立索引

PUT /blogs_ngram {   "settings": {     "index.max_ngram_diff": 20,      "analysis": {        "analyzer": {          "ngram_analyzer": {            "type": "custom",            "tokenizer": "ngram_tokenizer"          }        },        "tokenizer": {          "ngram_tokenizer": {            "type": "ngram",            "min_gram": 2,            "max_gram": 20,            "token_chars": [              "letter",              "digit"            ]          }        }      }   },   "mappings": {      "properties": {          "name": {              "type": "text",              "analyzer": "ngram_analyzer"          }      }   } }
复制代码


  1. 写入数据

POST _bulk/?refresh=true { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行重庆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行天津分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国招商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "工行新疆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "你我她公司使用手册范文"} { "index" : { "_index" : "blogs_ngram" } } { "name": "测试错别字使用手川"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马英九受邀请访问日本"} { "index" : { "_index" : "blogs_ngram" } } { "name": "台湾马英九到访美国"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马央九先生在吃水果"}
复制代码


  1. 查询,我们可以看到检索“工商”关键字,所有包含改关键字的 document 都返回,当然我们的目的并不是为了做检索,做模糊查询,是为了更近一步的理解这个分词器的分词原理

POST /blogs_ngram/_search{  "query": {    "term": {"name": "工商"}  }}
# result{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.8014579, "hits" : [ { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "vlmwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行北京分行" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "v1mwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行重庆分行" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "wFmwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行天津分行" } } ] }}
复制代码


  1. 我们先通过 _analyze 来理解下 NGram 分词器,哈哈,从结果就一目了然了吧,就是按照我们设置的步长从第一个字符开始往后截取,然后第二个字符,依次完成整个字符串的分词。也就理解了上文检索的结果

POST /blogs_ngram/_analyze{  "analyzer": "ngram_analyzer",  "text": "中国工商银行重庆分行"}
# result{ "tokens" : [ { "token" : "中国", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "中国工", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "中国工商", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "中国工商银", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 3 }, { "token" : "中国工商银行", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 4 }, { "token" : "中国工商银行重", "start_offset" : 0, "end_offset" : 7, "type" : "word", "position" : 5 }, { "token" : "中国工商银行重庆", "start_offset" : 0, "end_offset" : 8, "type" : "word", "position" : 6 }, { "token" : "中国工商银行重庆分", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 7 }, { "token" : "中国工商银行重庆分行", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 8 }, { "token" : "国工", "start_offset" : 1, "end_offset" : 3, "type" : "word", "position" : 9 }, { "token" : "国工商", "start_offset" : 1, "end_offset" : 4, "type" : "word", "position" : 10 }, { "token" : "国工商银", "start_offset" : 1, "end_offset" : 5, "type" : "word", "position" : 11 }, { "token" : "国工商银行", "start_offset" : 1, "end_offset" : 6, "type" : "word", "position" : 12 }, { "token" : "国工商银行重", "start_offset" : 1, "end_offset" : 7, "type" : "word", "position" : 13 }, { "token" : "国工商银行重庆", "start_offset" : 1, "end_offset" : 8, "type" : "word", "position" : 14 }, { "token" : "国工商银行重庆分", "start_offset" : 1, "end_offset" : 9, "type" : "word", "position" : 15 }, { "token" : "国工商银行重庆分行", "start_offset" : 1, "end_offset" : 10, "type" : "word", "position" : 16 }, { "token" : "工商", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 17 }, { "token" : "工商银", "start_offset" : 2, "end_offset" : 5, "type" : "word", "position" : 18 }, { "token" : "工商银行", "start_offset" : 2, "end_offset" : 6, "type" : "word", "position" : 19 }, { "token" : "工商银行重", "start_offset" : 2, "end_offset" : 7, "type" : "word", "position" : 20 }, { "token" : "工商银行重庆", "start_offset" : 2, "end_offset" : 8, "type" : "word", "position" : 21 }, { "token" : "工商银行重庆分", "start_offset" : 2, "end_offset" : 9, "type" : "word", "position" : 22 }, { "token" : "工商银行重庆分行", "start_offset" : 2, "end_offset" : 10, "type" : "word", "position" : 23 }, { "token" : "商银", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 24 }, { "token" : "商银行", "start_offset" : 3, "end_offset" : 6, "type" : "word", "position" : 25 }, { "token" : "商银行重", "start_offset" : 3, "end_offset" : 7, "type" : "word", "position" : 26 }, { "token" : "商银行重庆", "start_offset" : 3, "end_offset" : 8, "type" : "word", "position" : 27 }, { "token" : "商银行重庆分", "start_offset" : 3, "end_offset" : 9, "type" : "word", "position" : 28 }, { "token" : "商银行重庆分行", "start_offset" : 3, "end_offset" : 10, "type" : "word", "position" : 29 }, { "token" : "银行", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 30 }, { "token" : "银行重", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 31 }, { "token" : "银行重庆", "start_offset" : 4, "end_offset" : 8, "type" : "word", "position" : 32 }, { "token" : "银行重庆分", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 33 }, { "token" : "银行重庆分行", "start_offset" : 4, "end_offset" : 10, "type" : "word", "position" : 34 }, { "token" : "行重", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 35 }, { "token" : "行重庆", "start_offset" : 5, "end_offset" : 8, "type" : "word", "position" : 36 }, { "token" : "行重庆分", "start_offset" : 5, "end_offset" : 9, "type" : "word", "position" : 37 }, { "token" : "行重庆分行", "start_offset" : 5, "end_offset" : 10, "type" : "word", "position" : 38 }, { "token" : "重庆", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 39 }, { "token" : "重庆分", "start_offset" : 6, "end_offset" : 9, "type" : "word", "position" : 40 }, { "token" : "重庆分行", "start_offset" : 6, "end_offset" : 10, "type" : "word", "position" : 41 }, { "token" : "庆分", "start_offset" : 7, "end_offset" : 9, "type" : "word", "position" : 42 }, { "token" : "庆分行", "start_offset" : 7, "end_offset" : 10, "type" : "word", "position" : 43 }, { "token" : "分行", "start_offset" : 8, "end_offset" : 10, "type" : "word", "position" : 44 } ]}
复制代码


  1. 看下 Term suggest 结果,对于“马央九”和“手川”的 suggest 并不理想,没有给出任何的 options,但对于“使用手川”和“台湾马央九”都给出了基于分词的 options

POST /blogs_ngram/_search{   "suggest": {    "my-suggestion": {      "text": "使用手册",      "term": {        "field": "name",        "suggest_mode": "always"      }    }  }}
# result{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "suggest" : { "my-suggestion" : [ { "text" : "使用", "offset" : 0, "length" : 2, "options" : [ ] }, { "text" : "使用手", "offset" : 0, "length" : 3, "options" : [ ] }, { "text" : "使用手册", "offset" : 0, "length" : 4, "options" : [ { "text" : "使用手册范", "score" : 0.75, "freq" : 1 }, { "text" : "使用手川", "score" : 0.75, "freq" : 1 }, { "text" : "使用手", "score" : 0.6666666, "freq" : 2 }, { "text" : "使用手册范文", "score" : 0.5, "freq" : 1 } ] }, { "text" : "用手", "offset" : 1, "length" : 2, "options" : [ ] }, { "text" : "用手册", "offset" : 1, "length" : 3, "options" : [ ] }, { "text" : "手册", "offset" : 2, "length" : 2, "options" : [ ] } ] }}
复制代码


  1. 到此,我们可以根据前文返回的 “options+原词” 进行查询,这样就得到了最大范围的检索结果

POST /blogs_ngram/_search{    "query" : {        "terms": {          "name": [            "使用手川",            "使用手",            "使用手册",            "使用手册范"          ]        }    }}
# result { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "w1mwHHkBgt5AqocFYkFa", "_score" : 1.0, "_source" : { "name" : "你我她公司使用手册范文" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "xFmwHHkBgt5AqocFYkFa", "_score" : 1.0, "_source" : { "name" : "测试错别字使用手川" } } ] }}
复制代码


发布于: 2021 年 04 月 29 日阅读数: 166
用户头像

小林-1025

关注

还未添加个人签名 2018.03.01 加入

还未添加个人简介

评论

发布
暂无评论
002 ES NGram 分词 + suggest