释放搜索潜力：基于 ES(ElasticSearch) 打造高效的语义搜索系统，让信息尽在掌握

2023-10-27
浙江
本文字数：6304 字
阅读完需：约 21 分钟

释放搜索潜力：基于 ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[1.安装部署篇--简洁版]，支持 Linux/Windows 部署安装

效果展示

PaddleNLP Pipelines 是一个端到端智能文本产线框架，面向 NLP 全场景为用户提供低门槛构建强大产品级系统的能力。本项目将通过一种简单高效的方式搭建一套语义检索系统，使用自然语言文本通过语义进行智能文档查询，而不是关键字匹配。

基于ES(ElasticSearch)打造高效的语义搜索系统效果展示链接

点击链接进行跳转：

释放搜索潜力：基于ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[1.安装部署篇---完整版]，支持Linux/Windows部署安装

释放搜索潜力：基于ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[2.项目讲解篇],支持Linux/Windows部署安装

A1.Windows 下搭建语义检索系统

conda activate temp_ese:cd /temp_ES/PaddleNLP-develop/pipelines

腾讯镜像：-i https://mirrors.cloud.tencent.com/pypi/simple

pip list 版本：

paddle-pipelines 0.6.0paddlenlp 2.6.0paddlepaddle 2.5.1streamlit 1.11.1

pip install streamlit==1.11.1 -i https://mirrors.cloud.tencent.com/pypi/simplepip install altair==4.2.2 -i https://mirrors.cloud.tencent.com/pypi/simple

A1.1 运行环境安装

git clone https://github.com/tvst/htbuilder.gitcd htbuilder/python setup.py  install

复制代码

A1.2 paddlenlp 安装（包含了 paddlenlp）

pip install paddlenlp==2.6.0 -i https://mirrors.cloud.tencent.com/pypi/simple 
#pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple

#或者源码进行安装最新版本cd ${HOME}/PaddleNLP/pipelines/pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simplepython setup.py install

复制代码

A1.3 下载 pipelines 源代码：github 下载 or 手动下载

git clone https://github.com/PaddlePaddle/PaddleNLP.gitcd PaddleNLP/pipelines

复制代码

A1.4 运行案例查看效果

* 我们建议在 GPU 环境下运行本示例，运行速度较快

复制代码

python examples/semantic-search/semantic_search_example.py --device gpu

* 如果只有 CPU 机器，安装CPU版本的Paddle后，可以通过 --device 参数指定 cpu 即可, 运行耗时较长

复制代码

python examples/semantic-search/semantic_search_example.py --device cpu

模型相关修改见3.3

A2.ES 相关配置

A2.1 版本安装 ES 版本提前官网下载好即可，放在对应路径，进入虚拟环境

官网：https://www.elastic.co/cn/downloads/elasticsearch

https://blog.csdn.net/sinat_39620217/article/details/133984629

A2.2 可视化工具 Kibana

elasticsearch 可视化工具 Kibana：为了更好的对数据进行管理，可以使用 Kibana 可视化工具进行管理和分析，下载链接为 Kibana，下载完后解压，直接双击运行 bin\kibana.bat 即可。

链接：http://localhost:5601/app/home

A2.3 ES 修改：config

需要编辑 config/elasticsearch.yml，在末尾添加：elasticsearch.yml 把 xpack.security.enabled 设置成 false，xpack.security.enabled: false
然后直接双击 bin(右击管理员)目录下的 elasticsearch.bat 即可启动(elasticsearch-8.3.3\bin\elasticsearch.bat)。
Elastic search 日志显示错误 exception during geoip databases update

    ingest.geoip.downloader.enabled: false

复制代码

#查看es是否成功启动curl http://localhost:9200/_aliases?pretty=true

复制代码

A2.4 文档数据写入 ann 索引库(重点)

官网直接给这条语句，但会报错的，需要修改一下参数。

python utils/offline_ann.py --index_name dureader_robust_query_encoder

复制代码

可行命令：

python utils/offline_ann.py --index_name dureader_robust_query_encoder --doc_dir data/dureader_dev --search_engine elastic --embed_title True --delete_index --device cpu --query_embedding_model rocketqa-zh-nano-query-encoder --passage_embedding_model rocketqa-zh-nano-para-encoder --embedding_dim 312

复制代码

关注三个参数
query_embedding_model rocketqa-zh-nano-query-encoder
passage_embedding_model rocketqa-zh-nano-para-encoder
embedding_dim 312 这里都使用 nano 版本模型，向量维度 312

(尝试过可以换成 base 模型，768 维度，需要注意的是：启动 RestAPI 模型服务的时候，这三个参数一定要跟这里一致，否则报错，或者检索无效)

查看 es 中是否已经是有数据:

curl http://localhost:9200/dureader_robust_query_encoder/_search

复制代码

如果需要重新写入数据，则需要先删除索引:

curl -XDELETE http://localhost:9200/dureader_robust_query_encoder

复制代码

基于 Kibana 查看

A3.启动 Rest API 模型服务

这里要用要用 anaconda powershell，不能用 Anaconda prompt ！！！

这里要用 anaconda powershell ！！！

#指定语义检索系统的Yaml配置文件,Linux/macosexport PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml#指定语义检索系统的Yaml配置文件,Windows powershell$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'# 使用端口号 8891 启动模型服务python rest_api/application.py 8891

复制代码

#主要关注这三个参数:#embedding_dim: 312#query_embedding_model: rocketqa-zh-nano-query-encoder#passage_embedding_model: rocketqa-zh-nano-para-encoder#后面Ranker的model_name_or_path不用跟这里一致

复制代码

成功显示：端口链接显示

A4.启动 WebUI

streamlit 安装

pip install streamlit==1.11.1 -i https://mirrors.cloud.tencent.com/pypi/simple

复制代码


#anaconda powershell#配置模型服务地址$env:API_ENDPOINT='http://127.0.0.1:8891'#在指定端口 8502 启动 WebUIpython -m streamlit run ui/webapp_semantic_search.py --server.port 8502

复制代码

本地打开这个网页可以使用语义检索系统了：http://127.0.0.1:8502

http://localhost:8502/

A5. 数据更新

数据更新的方法有两种，第一种使用前面的 utils/offline_ann.py 进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持 txt，pdf，image，word 的格式，以 txt 格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：

兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
从海外市场表现看，对俄乌冲突的恐慌情绪已显著释放，海外权益市场也从单边下跌转入双向波动。
长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。

复制代码

B.linux 下搭建语义检索系统

B.1 GPU 版本

提示：Centos 系统下坑比较多，需要使用 paddle 2.4.2 Ubuntu 推荐使用 2.5.1 or develop。

1.1 安装依赖

conda create -n paddlenlp_gpu  python=3.8conda activate paddlenlp_gpupython -m pip install --upgrade pip

复制代码

PaddleGPU、CUDA cudnn 安装见：https://blog.csdn.net/sinat_39620217/article/details/131675175

当前版本：cuda11.2、paddle-develop 版本（2.5.1 存在 bug 解决方案见上述链接，可以使用 2.5.2 版本）

ImportError: libssl.so.1.1: cannot open shared object file: No such file or directory 等问题

版本查看：

pip install paddlepaddle-gpu==
(from versions: 1.8.5.post97, 1.8.5.post107, 2.0.0rc0, 2.0.0rc1, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0rc0, 2.2.0, 2.2.1, 2.2.2, 2.3.0rc0, 2.3.0, 2.3.1, 2.3.2, 2.4.0rc0, 2.4.0, 2.4.1, 2.4.2, 2.5.0rc0, 2.5.0rc1, 2.5.0, 2.5.1, 2.5.2)
pip install paddlenlp==
 (from versions: 2.0.0a0, 2.0.0a1, 2.0.0a2, 2.0.0a3, 2.0.0a4, 2.0.0a5, 2.0.0a6, 2.0.0a7, 2.0.0a8, 2.0.0a9, 2.0.0b0, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc0, 2.0.0rc1, 2.0.0rc2, 2.0.0rc3, 2.0.0rc4, 2.0.0rc5, 2.0.0rc6, 2.0.0rc7, 2.0.0rc8, 2.0.0rc9, 2.0.0rc10, 2.0.0rc11, 2.0.0rc12, 2.0.0rc13, 2.0.0rc14, 2.0.0rc15, 2.0.0rc16, 2.0.0rc17, 2.0.0rc18, 2.0.0rc19, 2.0.0rc20, 2.0.0rc21, 2.0.0rc22, 2.0.0rc23, 2.0.0rc24, 2.0.0rc25, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 2.0.8, 2.1.0, 2.1.1, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.2.6, 2.3.0rc0, 2.3.0rc1, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.3.5, 2.3.7, 2.4.0, 2.4.1.dev0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 2.4.9, 2.5.0, 2.5.1, 2.5.2, 2.6.0rc0, 2.6.0, 2.6.1)

复制代码

python -m pip install paddlepaddle-gpu==0.0.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
python -m pip install paddlepaddle-gpu==2.5.2 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install --upgrade paddlenlp -i https://mirrors.cloud.tencent.com/pypi/simple#看报错修改指令pip install --use-pep517 --upgrade paddlenlp -i https://mirrors.cloud.tencent.com/pypi/simple --trusted-host mirrors.aliyun.com
pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple

复制代码

pip install onnxruntime-gpu onnx onnxconverter-common -i https://mirrors.cloud.tencent.com/pypi/simple

复制代码

1.2 测试效果

conda activate paddlenlp_2.6.0cd /algorithm/temp_es/PaddleNLP-develop/pipelinescd /algorithm/temp_es/elasticsearch-8.3.3#到pipelines路径下
python examples/semantic-search/semantic_search_example.py --device gpu --search_engine faiss
python examples/semantic-search/semantic_search_example.py --device gpu --query_embedding_model rocketqa-zh-nano-query-encoder --params_path checkpoints/model_40/model_state.pdparams --embedding_dim 256

复制代码

1.3 执行 ES

创建新用户使用：创建一个新的用户，例如"elasticsearch":

sudo useradd elasticsearch1#将Elasticsearch的安装目录的所有权更改为"elasticsearch":
sudo chown -R elasticsearch1:elasticsearch1 /algorithm/temp_es/elasticsearch-8.3.3#切换到"elasticsearch"用户，并尝试再次运行Elasticsearch：
su elasticsearch1./bin/elasticsearch

#常驻待确定查看es启动了几个ps aux | grep elasticsearchps -ef | grep elasticsearch
#Elasticsearch在启动过程中遇到了问题。具体来说，它无法获取节点锁，可能是由于数据路径不可写或者多个节点试图使用同一个数据路径。#尝试清理数据路径/algorithm/temp_es/elasticsearch-8.3.3/data，删除其中的节点锁和其他临时文件rm -rf /algorithm/temp_es/elasticsearch-8.3.3/data/*

复制代码

1.4 构建 ANN 索引库

# 以DuReader-Robust 数据集为例建立 ANN 索引库python utils/offline_ann.py --index_name dureader_robust_neural_search --doc_dir data/dureader_dev --query_embedding_model rocketqa-zh-nano-query-encoder --passage_embedding_model rocketqa-zh-nano-para-encoder --embedding_dim 312 --delete_index
#查看数据，打印几条数据curl http://localhost:9200/dureader_robust_neural_search/_search
#删除索引也可以使用下面的命令：curl -XDELETE http://localhost:9200/dureader_robust_query_encoder

复制代码

1.5 启动 RestAPI 模型服务

#指定语义检索系统的Yaml配置文件export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search_custom.yaml#使用端口号 8891 启动模型服务python rest_api/application.py 8891

复制代码

nltk_data 加载，如果感觉很慢卡住了，可以见问题 C.20

Linux 用户推荐采用 Shell 脚本来启动服务：sh examples/semantic-search/run_neural_search_server.sh
启动后可以使用 curl 命令验证是否成功运行：

    curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "衡量酒水的价格的因素有哪些?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'

复制代码

1.6 启动 web 页面

pip install streamlit==1.11.1pip install altair==4.2.2 -i https://mirrors.cloud.tencent.com/pypi/simple#配置模型服务地址export API_ENDPOINT=http://127.0.0.1:8891
#在指定端口 8502 启动 WebUIpython -m streamlit run ui/webapp_semantic_search.py --server.port 8502#需要运维开阿里云网管以及端口授权

复制代码

Linux 用户推荐采用 Shell 脚本来启动服务：

    sh examples/semantic-search/run_search_web.sh

复制代码

到这里就可以打开浏览器访问 http://127.0.0.1:8502

关闭进程：

control+c
lsof -i:8502kill -9 PID

复制代码

B.2 CPU 版本

2.1 安装依赖库

安装同 GPU 选择 paddle-2.5.1 版本，提示：Centos 系统下坑比较多需要使用 paddle 2.4.2；Ubuntu 推荐使用 2.5.1 or develop。

conda activate paddlenlpcpu_2.6.0cd /algorithm/temp_es/PaddleNLP-develop/pipelinescd /algorithm/temp_es/elasticsearch-8.3.3
python -m pip install paddlepaddle==2.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install --upgrade paddlenlp -i https://mirrors.cloud.tencent.com/pypi/simple#paddle2.4.2 对应NLP 2.5.2版本
pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple

复制代码

demo 测试

python examples/semantic-search/semantic_search_example.py  --device cpu --embedding_dim 256
python examples/semantic-search/semantic_search_example.py --device cpu --query_embedding_model rocketqa-zh-nano-query-encoder --passage_embedding_model rocketqa-zh-nano-para-encoder --params_path checkpoints/model_40/model_state.pdparams --embedding_dim 312

复制代码

2.3 执行 ES

创建新用户使用：创建一个新的用户，例如"esuser":

sudo useradd esuser#将Elasticsearch的安装目录的所有权更改为"esuser":
sudo chown -R esuser:esuser /algorithm/temp_es/elasticsearch-8.3.3#切换到"esuser"用户，并尝试再次运行Elasticsearch：
su esuser  ./bin/elasticsearch

复制代码

2.4 构建索引

python utils/offline_ann.py --index_name dureader_robust_neural_search --doc_dir data/dureader_dev --embedding_dim 256 --device cpu --delete_index
#查看数据，打印几条数据curl http://localhost:9200/dureader_robust_neural_search/_search
#删除索引也可以使用下面的命令：curl -XDELETE http://localhost:9200/dureader_robust_query_encoder

复制代码

lsof -i:8502

kill -9 PID

python -m streamlit run ui/webapp_semantic_search.py --server.port 8502 --server.address 127.0.0.1

C.安装过程遇到相关问题解决---相关项目链接：

目前共记录 21 个在 Windows 和 LInux 下遇到的相关问题

点击链接进行跳转：

释放搜索潜力：基于ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[1.安装部署篇---完整版]，支持Linux/Windows部署安装

释放搜索潜力：基于ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[2.项目讲解篇],支持Linux/Windows部署安装

更多优质内容请关注公号：汀丶人工智能；会提供一些相关的资源和优质文章，免费获取阅读。

发布于: 2023-10-27阅读数: 2

原文链接:【http://xie.infoq.cn/article/40598402d3bbd387dc9541563】。

汀丶人工智能

关注

本博客将不定期更新关于NLP等领域相关知识 2022-01-06 加入

本博客将不定期更新关于机器学习、强化学习、数据挖掘以及NLP等领域相关知识，以及分享自己学习到的知识技能，感谢大家关注！

发布

暂无评论

创作场景

释放搜索潜力：基于 ES(ElasticSearch) 打造高效的语义搜索系统，让信息尽在掌握

释放搜索潜力：基于 ES(ElasticSearch)打造高效的语义搜索系统，让信息尽在掌握[1.安装部署篇--简洁版]，支持 Linux/Windows 部署安装

A1.Windows 下搭建语义检索系统

A1.1 运行环境安装

A1.2 paddlenlp 安装（包含了 paddlenlp）

A1.3 下载 pipelines 源代码：github 下载 or 手动下载

A1.4 运行案例查看效果

A2.ES 相关配置

A2.1 版本安装 ES 版本提前官网下载好即可，放在对应路径，进入虚拟环境

A2.2 可视化工具 Kibana

A2.3 ES 修改：config

A2.4 文档数据写入 ann 索引库(重点)

A3.启动 Rest API 模型服务

A4.启动 WebUI

A5. 数据更新

B.linux 下搭建语义检索系统

B.1 GPU 版本

1.1 安装依赖

1.2 测试效果

1.3 执行 ES

1.4 构建 ANN 索引库

1.5 启动 RestAPI 模型服务

1.6 启动 web 页面

B.2 CPU 版本

2.1 安装依赖库

2.3 执行 ES

2.4 构建索引

C.安装过程遇到相关问题解决---相关项目链接：

汀丶人工智能

评论