002 ES NGram 分词 + suggest
我们继续解决我们之前的需求,此功能可满足以下智能纠错+查询功能需求
需求描述:
在输入“使用手川”,即在输入错误的情况下,实现的检索结果需要包括“使用手册”和“使用手川”
在输入“马央九”,即在输入错误的情况下,实现的检索结果需要包括“马英九”和“马央九”
解决方案:
首先,对目标字段使用 NGram 进行分词,然后使用 term suggest 拿到 es 根据索引给出的 suggest 结果
再次,使用“检索关键词”+上一步返回的 options,进行最大范围的 term 检索
遗留问题:
(1)对单个词给的 options 并不理想
(2)通过_analyze 可以知道 ngram 分词会导致大量的 term,index 占用空间比较大
遗留问题方案:
(1)因为此次需求的业务方并没有精准单一词汇的检索需求,最后,业务方在关键字上做了处理
(2)业务方数据量并不是很大,而且属于离线任务
总结:
(1)假如单纯实现在检索框中输入文字,然后给出下拉提示的功能,可以对用户的检索词进行记录,然后通过 prefix completion 功能完成
(2)为什么不是采用 ik_max_word,然后检索的时候经过分词不就可以达到最大范围的覆盖查询了吗?原因有以下两点
a. 先从分词器上看,"使用手册"的分词结果包括“使用”“用手”,在几十万文档的情况下,包括这类关键词的返回 hit 非常多,然而这不是实际需要的数据,所以我们在查询上需要使用精准的 term 查询,返回的数据在业务上才可以使用
b. 再者,业务上需要检索的关键字并不一定在 ik 的分词范围之内,不太可能每次去修改 dict 然后进行索引重建
建立索引
PUT /blogs_ngram { "settings": { "index.max_ngram_diff": 20, "analysis": { "analyzer": { "ngram_analyzer": { "type": "custom", "tokenizer": "ngram_tokenizer" } }, "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 2, "max_gram": 20, "token_chars": [ "letter", "digit" ] } } } }, "mappings": { "properties": { "name": { "type": "text", "analyzer": "ngram_analyzer" } } } }写入数据
POST _bulk/?refresh=true { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行重庆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行天津分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国招商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "工行新疆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "你我她公司使用手册范文"} { "index" : { "_index" : "blogs_ngram" } } { "name": "测试错别字使用手川"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马英九受邀请访问日本"} { "index" : { "_index" : "blogs_ngram" } } { "name": "台湾马英九到访美国"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马央九先生在吃水果"}查询,我们可以看到检索“工商”关键字,所有包含改关键字的 document 都返回,当然我们的目的并不是为了做检索,做模糊查询,是为了更近一步的理解这个分词器的分词原理
POST /blogs_ngram/_search{ "query": { "term": {"name": "工商"} }}
# result{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.8014579, "hits" : [ { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "vlmwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行北京分行" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "v1mwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行重庆分行" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "wFmwHHkBgt5AqocFYkFa", "_score" : 0.8014579, "_source" : { "name" : "中国工商银行天津分行" } } ] }}
我们先通过 _analyze 来理解下 NGram 分词器,哈哈,从结果就一目了然了吧,就是按照我们设置的步长从第一个字符开始往后截取,然后第二个字符,依次完成整个字符串的分词。也就理解了上文检索的结果
POST /blogs_ngram/_analyze{ "analyzer": "ngram_analyzer", "text": "中国工商银行重庆分行"}
# result{ "tokens" : [ { "token" : "中国", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "中国工", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : "中国工商", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "中国工商银", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 3 }, { "token" : "中国工商银行", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 4 }, { "token" : "中国工商银行重", "start_offset" : 0, "end_offset" : 7, "type" : "word", "position" : 5 }, { "token" : "中国工商银行重庆", "start_offset" : 0, "end_offset" : 8, "type" : "word", "position" : 6 }, { "token" : "中国工商银行重庆分", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 7 }, { "token" : "中国工商银行重庆分行", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 8 }, { "token" : "国工", "start_offset" : 1, "end_offset" : 3, "type" : "word", "position" : 9 }, { "token" : "国工商", "start_offset" : 1, "end_offset" : 4, "type" : "word", "position" : 10 }, { "token" : "国工商银", "start_offset" : 1, "end_offset" : 5, "type" : "word", "position" : 11 }, { "token" : "国工商银行", "start_offset" : 1, "end_offset" : 6, "type" : "word", "position" : 12 }, { "token" : "国工商银行重", "start_offset" : 1, "end_offset" : 7, "type" : "word", "position" : 13 }, { "token" : "国工商银行重庆", "start_offset" : 1, "end_offset" : 8, "type" : "word", "position" : 14 }, { "token" : "国工商银行重庆分", "start_offset" : 1, "end_offset" : 9, "type" : "word", "position" : 15 }, { "token" : "国工商银行重庆分行", "start_offset" : 1, "end_offset" : 10, "type" : "word", "position" : 16 }, { "token" : "工商", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 17 }, { "token" : "工商银", "start_offset" : 2, "end_offset" : 5, "type" : "word", "position" : 18 }, { "token" : "工商银行", "start_offset" : 2, "end_offset" : 6, "type" : "word", "position" : 19 }, { "token" : "工商银行重", "start_offset" : 2, "end_offset" : 7, "type" : "word", "position" : 20 }, { "token" : "工商银行重庆", "start_offset" : 2, "end_offset" : 8, "type" : "word", "position" : 21 }, { "token" : "工商银行重庆分", "start_offset" : 2, "end_offset" : 9, "type" : "word", "position" : 22 }, { "token" : "工商银行重庆分行", "start_offset" : 2, "end_offset" : 10, "type" : "word", "position" : 23 }, { "token" : "商银", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 24 }, { "token" : "商银行", "start_offset" : 3, "end_offset" : 6, "type" : "word", "position" : 25 }, { "token" : "商银行重", "start_offset" : 3, "end_offset" : 7, "type" : "word", "position" : 26 }, { "token" : "商银行重庆", "start_offset" : 3, "end_offset" : 8, "type" : "word", "position" : 27 }, { "token" : "商银行重庆分", "start_offset" : 3, "end_offset" : 9, "type" : "word", "position" : 28 }, { "token" : "商银行重庆分行", "start_offset" : 3, "end_offset" : 10, "type" : "word", "position" : 29 }, { "token" : "银行", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 30 }, { "token" : "银行重", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 31 }, { "token" : "银行重庆", "start_offset" : 4, "end_offset" : 8, "type" : "word", "position" : 32 }, { "token" : "银行重庆分", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 33 }, { "token" : "银行重庆分行", "start_offset" : 4, "end_offset" : 10, "type" : "word", "position" : 34 }, { "token" : "行重", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 35 }, { "token" : "行重庆", "start_offset" : 5, "end_offset" : 8, "type" : "word", "position" : 36 }, { "token" : "行重庆分", "start_offset" : 5, "end_offset" : 9, "type" : "word", "position" : 37 }, { "token" : "行重庆分行", "start_offset" : 5, "end_offset" : 10, "type" : "word", "position" : 38 }, { "token" : "重庆", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 39 }, { "token" : "重庆分", "start_offset" : 6, "end_offset" : 9, "type" : "word", "position" : 40 }, { "token" : "重庆分行", "start_offset" : 6, "end_offset" : 10, "type" : "word", "position" : 41 }, { "token" : "庆分", "start_offset" : 7, "end_offset" : 9, "type" : "word", "position" : 42 }, { "token" : "庆分行", "start_offset" : 7, "end_offset" : 10, "type" : "word", "position" : 43 }, { "token" : "分行", "start_offset" : 8, "end_offset" : 10, "type" : "word", "position" : 44 } ]}
看下 Term suggest 结果,对于“马央九”和“手川”的 suggest 并不理想,没有给出任何的 options,但对于“使用手川”和“台湾马央九”都给出了基于分词的 options
POST /blogs_ngram/_search{ "suggest": { "my-suggestion": { "text": "使用手册", "term": { "field": "name", "suggest_mode": "always" } } }}
# result{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "suggest" : { "my-suggestion" : [ { "text" : "使用", "offset" : 0, "length" : 2, "options" : [ ] }, { "text" : "使用手", "offset" : 0, "length" : 3, "options" : [ ] }, { "text" : "使用手册", "offset" : 0, "length" : 4, "options" : [ { "text" : "使用手册范", "score" : 0.75, "freq" : 1 }, { "text" : "使用手川", "score" : 0.75, "freq" : 1 }, { "text" : "使用手", "score" : 0.6666666, "freq" : 2 }, { "text" : "使用手册范文", "score" : 0.5, "freq" : 1 } ] }, { "text" : "用手", "offset" : 1, "length" : 2, "options" : [ ] }, { "text" : "用手册", "offset" : 1, "length" : 3, "options" : [ ] }, { "text" : "手册", "offset" : 2, "length" : 2, "options" : [ ] } ] }}
到此,我们可以根据前文返回的 “options+原词” 进行查询,这样就得到了最大范围的检索结果
POST /blogs_ngram/_search{ "query" : { "terms": { "name": [ "使用手川", "使用手", "使用手册", "使用手册范" ] } }}
# result { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "w1mwHHkBgt5AqocFYkFa", "_score" : 1.0, "_source" : { "name" : "你我她公司使用手册范文" } }, { "_index" : "blogs_ngram", "_type" : "_doc", "_id" : "xFmwHHkBgt5AqocFYkFa", "_score" : 1.0, "_source" : { "name" : "测试错别字使用手川" } } ] }}
版权声明: 本文为 InfoQ 作者【小林-1025】的原创文章。
原文链接:【http://xie.infoq.cn/article/bb94949754e4c537ca322f747】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。
小林-1025
还未添加个人签名 2018.03.01 加入
还未添加个人简介











评论