002 ES NGram 分词 + suggest

小林-1025

关注

发布于: 2021 年 04 月 29 日

我们继续解决我们之前的需求，此功能可满足以下智能纠错+查询功能需求

需求描述：

在输入“使用手川”，即在输入错误的情况下，实现的检索结果需要包括“使用手册”和“使用手川”

在输入“马央九”，即在输入错误的情况下，实现的检索结果需要包括“马英九”和“马央九”

解决方案：

首先，对目标字段使用 NGram 进行分词，然后使用 term suggest 拿到 es 根据索引给出的 suggest 结果

再次，使用“检索关键词”+上一步返回的 options，进行最大范围的 term 检索

遗留问题：

（1）对单个词给的 options 并不理想

（2）通过_analyze 可以知道 ngram 分词会导致大量的 term，index 占用空间比较大

遗留问题方案：

（1）因为此次需求的业务方并没有精准单一词汇的检索需求，最后，业务方在关键字上做了处理

（2）业务方数据量并不是很大，而且属于离线任务

总结：

（1）假如单纯实现在检索框中输入文字，然后给出下拉提示的功能，可以对用户的检索词进行记录，然后通过 prefix completion 功能完成

（2）为什么不是采用 ik_max_word，然后检索的时候经过分词不就可以达到最大范围的覆盖查询了吗？原因有以下两点

a. 先从分词器上看，"使用手册"的分词结果包括“使用”“用手”，在几十万文档的情况下，包括这类关键词的返回 hit 非常多，然而这不是实际需要的数据，所以我们在查询上需要使用精准的 term 查询，返回的数据在业务上才可以使用

b. 再者，业务上需要检索的关键字并不一定在 ik 的分词范围之内，不太可能每次去修改 dict 然后进行索引重建

建立索引

PUT /blogs_ngram {   "settings": {     "index.max_ngram_diff": 20,      "analysis": {        "analyzer": {          "ngram_analyzer": {            "type": "custom",            "tokenizer": "ngram_tokenizer"          }        },        "tokenizer": {          "ngram_tokenizer": {            "type": "ngram",            "min_gram": 2,            "max_gram": 20,            "token_chars": [              "letter",              "digit"            ]          }        }      }   },   "mappings": {      "properties": {          "name": {              "type": "text",              "analyzer": "ngram_analyzer"          }      }   } }

复制代码

写入数据

POST _bulk/?refresh=true { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行重庆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国工商银行天津分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "中国招商银行北京分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "工行新疆分行"} { "index" : { "_index" : "blogs_ngram" } } { "name": "你我她公司使用手册范文"} { "index" : { "_index" : "blogs_ngram" } } { "name": "测试错别字使用手川"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马英九受邀请访问日本"} { "index" : { "_index" : "blogs_ngram" } } { "name": "台湾马英九到访美国"} { "index" : { "_index" : "blogs_ngram" } } { "name": "马央九先生在吃水果"}

复制代码

查询，我们可以看到检索“工商”关键字，所有包含改关键字的 document 都返回，当然我们的目的并不是为了做检索，做模糊查询，是为了更近一步的理解这个分词器的分词原理

POST /blogs_ngram/_search{  "query": {    "term": {"name": "工商"}  }}
# result{  "took" : 2,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 3,      "relation" : "eq"    },    "max_score" : 0.8014579,    "hits" : [      {        "_index" : "blogs_ngram",        "_type" : "_doc",        "_id" : "vlmwHHkBgt5AqocFYkFa",        "_score" : 0.8014579,        "_source" : {          "name" : "中国工商银行北京分行"        }      },      {        "_index" : "blogs_ngram",        "_type" : "_doc",        "_id" : "v1mwHHkBgt5AqocFYkFa",        "_score" : 0.8014579,        "_source" : {          "name" : "中国工商银行重庆分行"        }      },      {        "_index" : "blogs_ngram",        "_type" : "_doc",        "_id" : "wFmwHHkBgt5AqocFYkFa",        "_score" : 0.8014579,        "_source" : {          "name" : "中国工商银行天津分行"        }      }    ]  }}

复制代码

我们先通过 _analyze 来理解下 NGram 分词器，哈哈，从结果就一目了然了吧，就是按照我们设置的步长从第一个字符开始往后截取，然后第二个字符，依次完成整个字符串的分词。也就理解了上文检索的结果

POST /blogs_ngram/_analyze{  "analyzer": "ngram_analyzer",  "text": "中国工商银行重庆分行"}
# result{  "tokens" : [    {      "token" : "中国",      "start_offset" : 0,      "end_offset" : 2,      "type" : "word",      "position" : 0    },    {      "token" : "中国工",      "start_offset" : 0,      "end_offset" : 3,      "type" : "word",      "position" : 1    },    {      "token" : "中国工商",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 2    },    {      "token" : "中国工商银",      "start_offset" : 0,      "end_offset" : 5,      "type" : "word",      "position" : 3    },    {      "token" : "中国工商银行",      "start_offset" : 0,      "end_offset" : 6,      "type" : "word",      "position" : 4    },    {      "token" : "中国工商银行重",      "start_offset" : 0,      "end_offset" : 7,      "type" : "word",      "position" : 5    },    {      "token" : "中国工商银行重庆",      "start_offset" : 0,      "end_offset" : 8,      "type" : "word",      "position" : 6    },    {      "token" : "中国工商银行重庆分",      "start_offset" : 0,      "end_offset" : 9,      "type" : "word",      "position" : 7    },    {      "token" : "中国工商银行重庆分行",      "start_offset" : 0,      "end_offset" : 10,      "type" : "word",      "position" : 8    },    {      "token" : "国工",      "start_offset" : 1,      "end_offset" : 3,      "type" : "word",      "position" : 9    },    {      "token" : "国工商",      "start_offset" : 1,      "end_offset" : 4,      "type" : "word",      "position" : 10    },    {      "token" : "国工商银",      "start_offset" : 1,      "end_offset" : 5,      "type" : "word",      "position" : 11    },    {      "token" : "国工商银行",      "start_offset" : 1,      "end_offset" : 6,      "type" : "word",      "position" : 12    },    {      "token" : "国工商银行重",      "start_offset" : 1,      "end_offset" : 7,      "type" : "word",      "position" : 13    },    {      "token" : "国工商银行重庆",      "start_offset" : 1,      "end_offset" : 8,      "type" : "word",      "position" : 14    },    {      "token" : "国工商银行重庆分",      "start_offset" : 1,      "end_offset" : 9,      "type" : "word",      "position" : 15    },    {      "token" : "国工商银行重庆分行",      "start_offset" : 1,      "end_offset" : 10,      "type" : "word",      "position" : 16    },    {      "token" : "工商",      "start_offset" : 2,      "end_offset" : 4,      "type" : "word",      "position" : 17    },    {      "token" : "工商银",      "start_offset" : 2,      "end_offset" : 5,      "type" : "word",      "position" : 18    },    {      "token" : "工商银行",      "start_offset" : 2,      "end_offset" : 6,      "type" : "word",      "position" : 19    },    {      "token" : "工商银行重",      "start_offset" : 2,      "end_offset" : 7,      "type" : "word",      "position" : 20    },    {      "token" : "工商银行重庆",      "start_offset" : 2,      "end_offset" : 8,      "type" : "word",      "position" : 21    },    {      "token" : "工商银行重庆分",      "start_offset" : 2,      "end_offset" : 9,      "type" : "word",      "position" : 22    },    {      "token" : "工商银行重庆分行",      "start_offset" : 2,      "end_offset" : 10,      "type" : "word",      "position" : 23    },    {      "token" : "商银",      "start_offset" : 3,      "end_offset" : 5,      "type" : "word",      "position" : 24    },    {      "token" : "商银行",      "start_offset" : 3,      "end_offset" : 6,      "type" : "word",      "position" : 25    },    {      "token" : "商银行重",      "start_offset" : 3,      "end_offset" : 7,      "type" : "word",      "position" : 26    },    {      "token" : "商银行重庆",      "start_offset" : 3,      "end_offset" : 8,      "type" : "word",      "position" : 27    },    {      "token" : "商银行重庆分",      "start_offset" : 3,      "end_offset" : 9,      "type" : "word",      "position" : 28    },    {      "token" : "商银行重庆分行",      "start_offset" : 3,      "end_offset" : 10,      "type" : "word",      "position" : 29    },    {      "token" : "银行",      "start_offset" : 4,      "end_offset" : 6,      "type" : "word",      "position" : 30    },    {      "token" : "银行重",      "start_offset" : 4,      "end_offset" : 7,      "type" : "word",      "position" : 31    },    {      "token" : "银行重庆",      "start_offset" : 4,      "end_offset" : 8,      "type" : "word",      "position" : 32    },    {      "token" : "银行重庆分",      "start_offset" : 4,      "end_offset" : 9,      "type" : "word",      "position" : 33    },    {      "token" : "银行重庆分行",      "start_offset" : 4,      "end_offset" : 10,      "type" : "word",      "position" : 34    },    {      "token" : "行重",      "start_offset" : 5,      "end_offset" : 7,      "type" : "word",      "position" : 35    },    {      "token" : "行重庆",      "start_offset" : 5,      "end_offset" : 8,      "type" : "word",      "position" : 36    },    {      "token" : "行重庆分",      "start_offset" : 5,      "end_offset" : 9,      "type" : "word",      "position" : 37    },    {      "token" : "行重庆分行",      "start_offset" : 5,      "end_offset" : 10,      "type" : "word",      "position" : 38    },    {      "token" : "重庆",      "start_offset" : 6,      "end_offset" : 8,      "type" : "word",      "position" : 39    },    {      "token" : "重庆分",      "start_offset" : 6,      "end_offset" : 9,      "type" : "word",      "position" : 40    },    {      "token" : "重庆分行",      "start_offset" : 6,      "end_offset" : 10,      "type" : "word",      "position" : 41    },    {      "token" : "庆分",      "start_offset" : 7,      "end_offset" : 9,      "type" : "word",      "position" : 42    },    {      "token" : "庆分行",      "start_offset" : 7,      "end_offset" : 10,      "type" : "word",      "position" : 43    },    {      "token" : "分行",      "start_offset" : 8,      "end_offset" : 10,      "type" : "word",      "position" : 44    }  ]}

复制代码

看下 Term suggest 结果，对于“马央九”和“手川”的 suggest 并不理想，没有给出任何的 options，但对于“使用手川”和“台湾马央九”都给出了基于分词的 options

POST /blogs_ngram/_search{   "suggest": {    "my-suggestion": {      "text": "使用手册",      "term": {        "field": "name",        "suggest_mode": "always"      }    }  }}
# result{  "took" : 2,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 0,      "relation" : "eq"    },    "max_score" : null,    "hits" : [ ]  },  "suggest" : {    "my-suggestion" : [      {        "text" : "使用",        "offset" : 0,        "length" : 2,        "options" : [ ]      },      {        "text" : "使用手",        "offset" : 0,        "length" : 3,        "options" : [ ]      },      {        "text" : "使用手册",        "offset" : 0,        "length" : 4,        "options" : [          {            "text" : "使用手册范",            "score" : 0.75,            "freq" : 1          },          {            "text" : "使用手川",            "score" : 0.75,            "freq" : 1          },          {            "text" : "使用手",            "score" : 0.6666666,            "freq" : 2          },          {            "text" : "使用手册范文",            "score" : 0.5,            "freq" : 1          }        ]      },      {        "text" : "用手",        "offset" : 1,        "length" : 2,        "options" : [ ]      },      {        "text" : "用手册",        "offset" : 1,        "length" : 3,        "options" : [ ]      },      {        "text" : "手册",        "offset" : 2,        "length" : 2,        "options" : [ ]      }    ]  }}

复制代码

到此，我们可以根据前文返回的 “options+原词” 进行查询，这样就得到了最大范围的检索结果

POST /blogs_ngram/_search{    "query" : {        "terms": {          "name": [            "使用手川",            "使用手",            "使用手册",            "使用手册范"          ]        }    }}
# result {  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 2,      "relation" : "eq"    },    "max_score" : 1.0,    "hits" : [      {        "_index" : "blogs_ngram",        "_type" : "_doc",        "_id" : "w1mwHHkBgt5AqocFYkFa",        "_score" : 1.0,        "_source" : {          "name" : "你我她公司使用手册范文"        }      },      {        "_index" : "blogs_ngram",        "_type" : "_doc",        "_id" : "xFmwHHkBgt5AqocFYkFa",        "_score" : 1.0,        "_source" : {          "name" : "测试错别字使用手川"        }      }    ]  }}

复制代码

发布于: 2021 年 04 月 29 日阅读数: 166

原文链接:【http://xie.infoq.cn/article/bb94949754e4c537ca322f747】。

小林-1025

关注

还未添加个人签名 2018.03.01 加入

还未添加个人简介

发布

暂无评论

创作场景

002 ES NGram 分词 + suggest

小林-1025

评论