写点什么

Lucene 环境下基于 NGram 分词的中文高级检索

作者:alexgaoyh
  • 2024-04-27
    河南
  • 本文字数:3754 字

    阅读完需:约 12 分钟

背景

  本文是在中文环境下,基于 Lucene 搜索引擎的 NGram 分词器,实现的高级检索功能,配合多种不同形式下的搜索条件进行检索。


  当前高级检索的实现思路为:按照搜索条件的顺序,一个条件一个条件的添加搜索条件。比如: (A must B should C must D)代表 ((A 与 B)或 C)与 D)。

测试样本

[  {"title":  "番茄原产南美洲,学名番茄。。", "remark":  "用来进行额外的说明备注字段", "seqNo":  123456},  {"title":  "番茄,即西红柿,石头里面。", "remark":  "https://pap-docs.pap.net.cn/ 开发测试的说明备注字段", "seqNo":  123457},  {"title":  "番石榴", "remark":  "搜索一下:https://gitee.com/alexgaoyh", "seqNo":  123458}]
复制代码

搜索示例

  接下来列举多种搜索条件,并且返回搜索条件下匹配的搜索结果和针对 title 字段的搜索结果高亮展示。



    // relation-关系(与must / 或shoulud), field-搜索字段, keyword-搜索关键词, match(精确term /模糊fuzzy), boost 权重    // 在页面交互的过程中,第一行的搜索条件是没有关系的,这里以第二行搜索条件的关系为准。    public HighSearchDTO(String relation, String field, String keyword, String match, float boost) {        this.relation = relation;        this.field = field;        this.keyword = keyword;        this.match = match;        this.boost = boost;    }
// 此条件说明 必须包含如下两个条件: tilte模糊搜索‘榴’ remark精确搜索‘ao’ // 搜索结果 // 命中一条记录: // 1、番石<font color='red'>榴</font> highSearchDTOs.add(new HighSearchDTO("", "title", "榴", "fuzzy", 4.0f)); highSearchDTOs.add(new HighSearchDTO("must", "remark", "ao", "term", 4.0f));

// 此条件说明 或包含如下两个条件: tilte模糊搜索‘榴’ remark精确搜索‘ao’ // 搜索结果 // 命中两条记录: // 1、番石<font color='red'>榴</font> // 2、番茄,即西红柿,石头里面。 highSearchDTOs.add(new HighSearchDTO("", "title", "榴", "fuzzy", 4.0f)); highSearchDTOs.add(new HighSearchDTO("should", "remark", "ao", "term", 4.0f));

// 此条件说明 精确包含如下两个条件 (模糊包含如下两个条件: tilte模糊搜索‘榴’ remark精确搜索‘ao’ ) title精确搜索‘石’ // 搜索结果 // 命中两条记录: // 1、番<font color='red'>石</font><font color='red'>榴</font> // 2、番茄,即西红柿,<font color='red'>石</font>头里面。 highSearchDTOs.add(new HighSearchDTO("", "title", "榴", "fuzzy", 4.0f)); highSearchDTOs.add(new HighSearchDTO("should", "remark", "ao", "term", 4.0f)); highSearchDTOs.add(new HighSearchDTO("must", "title", "石", "term", 4.0f));

// 此条件说明 精确包含如下两个条件 (精确包含如下两个条件: tilte模糊搜索‘榴’ remark精确搜索‘ao’ ) title精确搜索‘石’ // 搜索结果 // 命中一条记录: // 1、番<font color='red'>石</font><font color='red'>榴</font> highSearchDTOs.add(new HighSearchDTO("", "title", "榴", "fuzzy", 4.0f)); highSearchDTOs.add(new HighSearchDTO("must", "remark", "ao", "term", 4.0f)); highSearchDTOs.add(new HighSearchDTO("must", "title", "石", "term", 4.0f));

// 此条件说明 或包含如下两个条件 (或包含如下两个条件: tilte模糊搜索‘榴’ remark精确搜索‘ao’ ) title精确搜索‘石’ // 搜索结果 // 命中两条记录: // 1、番石<font color='red'>榴</font> // 2、番茄,即<font color='red'>西</font>红柿,石头里面。 highSearchDTOs.add(new HighSearchDTO("", "title", "榴", "fuzzy", 4.0f)); highSearchDTOs.add(new HighSearchDTO("should", "remark", "ao", "term", 4.0f)); highSearchDTOs.add(new HighSearchDTO("should", "title", "西", "term", 4.0f));
复制代码

实现

  列出如何将 HighSearchDTO 对象集合转换为 Lucen 的 Query 搜索对象.


    public static BooleanQuery.Builder convert(List<HighSearchDTO> paramsList, Analyzer analyzer) throws Exception {
BooleanQuery.Builder bq = new BooleanQuery.Builder();
if(paramsList.size() > 1) { for(int idx = 0; idx < paramsList.size() - 1; idx++) { if(idx == 0) { bq = merge(paramsList.get(idx), paramsList.get(idx + 1), analyzer); } else { bq = merge(bq, paramsList.get(idx + 1), analyzer); } } } else { HighSearchDTO left = paramsList.get(0); if(left.getMatch().equals("term")) { QueryParser queryParser = new QueryParser(left.getField(), analyzer); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query queryLeft = queryParser.parse(left.getKeyword()); bq.add(queryLeft, BooleanClause.Occur.MUST); } else if(left.getMatch().equals("fuzzy")) { QueryParser queryParser = new QueryParser(left.getField(), analyzer); Query queryLeft = queryParser.parse(left.getKeyword()); bq.add(queryLeft, BooleanClause.Occur.MUST); } }
return bq; }
public static BooleanQuery.Builder merge(HighSearchDTO left, HighSearchDTO right, Analyzer analyzer) throws Exception { BooleanQuery.Builder bq = new BooleanQuery.Builder();
Query queryLeft = null; if(left.getMatch().equals("term")) { QueryParser queryParser = new QueryParser(left.getField(), analyzer); queryParser.setDefaultOperator(QueryParser.Operator.AND); queryLeft = queryParser.parse(left.getKeyword()); } else if(left.getMatch().equals("fuzzy")) { QueryParser queryParser = new QueryParser(left.getField(), analyzer); queryLeft = queryParser.parse(left.getKeyword()); }
Query queryRight = null; if(right.getMatch().equals("term")) { QueryParser queryParser = new QueryParser(right.getField(), analyzer); queryParser.setDefaultOperator(QueryParser.Operator.AND); queryRight = queryParser.parse(right.getKeyword()); } else if(right.getMatch().equals("fuzzy")) { QueryParser queryParser = new QueryParser(right.getField(), analyzer); queryRight = queryParser.parse(right.getKeyword()); }
if(right.getRelation().equals("must")) { bq.add(queryLeft, BooleanClause.Occur.MUST); bq.add(queryRight, BooleanClause.Occur.MUST); } else { bq.add(queryLeft, BooleanClause.Occur.SHOULD); bq.add(queryRight, BooleanClause.Occur.SHOULD); bq.setMinimumNumberShouldMatch(1); }
return bq; }
public static BooleanQuery.Builder merge(BooleanQuery.Builder leftQuery, HighSearchDTO right, Analyzer analyzer) throws Exception {
BooleanQuery.Builder bq = new BooleanQuery.Builder();
Query queryRight = null; if(right.getMatch().equals("term")) { QueryParser queryParser = new QueryParser(right.getField(), analyzer); queryParser.setDefaultOperator(QueryParser.Operator.AND); queryRight = queryParser.parse(right.getKeyword()); } else if(right.getMatch().equals("fuzzy")) { QueryParser queryParser = new QueryParser(right.getField(), analyzer); queryRight = queryParser.parse(right.getKeyword()); }
if(right.getRelation().equals("must")) { bq.add(leftQuery.build(), BooleanClause.Occur.MUST); bq.add(queryRight, BooleanClause.Occur.MUST); } else { bq.add(leftQuery.build(), BooleanClause.Occur.SHOULD); bq.add(queryRight, BooleanClause.Occur.SHOULD); bq.setMinimumNumberShouldMatch(1); }
return bq; }
复制代码

参考

  1. http://pap-docs.pap.net.cn/

  2. https://gitee.com/alexgaoyh


发布于: 刚刚阅读数: 5
用户头像

alexgaoyh

关注

DevOps 2013-12-08 加入

https://gitee.com/alexgaoyh

评论

发布
暂无评论
Lucene环境下基于NGram分词的中文高级检索_中文分词_alexgaoyh_InfoQ写作社区