喂饭级教程 —— 基于 OceanBase seekdb 构建 RAG 应用本文又是一篇喂饭级教程,为大家展示通过 OceanBase seekdb 构建 RAG(检索增强生成)系统的详细步骤。
RAG 系统结合了检索系统和生成模型,可根据给定提示生成新文本。系统首先使用 seekdb 的原生向量搜索功能从语料库中检索相关文档,然后使用生成模型根据检索到的文档生成新文本。
前提条件
已安装 Python 3.11 或以上版本
已安装 uv
已准备好 LLM API Key
准备工作
克隆代码
git clone https://github.com/oceanbase/pyseekdb.git
cd pyseekdb/demo/rag
复制代码
设置环境
安装依赖
基础安装(适用于 default 或 api embedding 类型):
uv sync
本地模型(适用于 local embedding 类型):
uv sync --extra local
提示:
local额外依赖包含 sentence-transformers 及相关依赖(约 2-3 GB)。
如果您在中国大陆,可以使用国内镜像源加速下载:
基础安装(清华源):uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple
基础安装(阿里源):uv sync --index-url https://mirrors.aliyun.com/pypi/simple
本地模型(清华源):uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple
本地模型(阿里源):uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple
设置环境变量
步骤一:复制环境变量模板
cp .env.example .env
步骤二:编辑 .env 文件,设置环境变量
本系统支持三种 Embedding 函数类型,您可以根据需求选择:
default(默认,推荐新手使用)
local(本地模型)
api(API 服务)
以下使用通义千问作为示例(使用 api 类型):
# Embedding Function 类型:api, local, defaultEMBEDDING_FUNCTION_TYPE=api
# LLM 配置(用于生成答案)OPENAI_API_KEY=sk-your-dashscope-keyOPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1OPENAI_MODEL_NAME=qwen-plus
# Embedding API 配置(仅在 EMBEDDING_FUNCTION_TYPE=api 时需要)EMBEDDING_API_KEY=sk-your-dashscope-keyEMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1EMBEDDING_MODEL_NAME=text-embedding-v4
# 本地模型配置(仅在 EMBEDDING_FUNCTION_TYPE=local 时需要)SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2SENTENCE_TRANSFORMERS_DEVICE=cpu
# seekdb 配置SEEKDB_DIR=./data/seekdb_ragSEEKDB_NAME=testCOLLECTION_NAME=embeddings
复制代码
环境变量说明:
提示:
如果使用 default 类型,只需配置 EMBEDDING_FUNCTION_TYPE=default 和 LLM 相关变量即可。
如果使用 api 类型,需要额外配置 Embedding API 相关变量。
如果使用 local 类型,需要安装 sentence-transformers 库,并可选择配置模型名称。
主要使用的模块
初始化 LLM 客户端
我们通过加载环境变量来初始化 LLM 客户端:
def get_llm_client() -> OpenAI: """Initialize LLM client using OpenAI-compatible API.""" return OpenAI( api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_BASE_URL"), )
复制代码
创建数据库连接
def get_seekdb_client(db_dir: str = "./seekdb_rag", db_name: str = "test"): """Initialize seekdb client (embedded mode).""" cache_key = (db_dir, db_name) if cache_key not in _client_cache: print(f"Connecting to seekdb: path={db_dir}, database={db_name}") _client_cache[cache_key] = Client(path=db_dir, database=db_name) print("seekdb client connected successfully") return _client_cache[cache_key]
复制代码
自定义嵌入模型的工厂模式
在 .env 文件中可以通过配置 EMBEDDING_FUNCTION_TYPE 使用不同的 embedding_function。您也可以参考这个例子自定义您的 embedding_function。
from pyseekdb import EmbeddingFunction, DefaultEmbeddingFunctionfrom typing import List, Unionimport osfrom openai import OpenAI
Documents = Union[str, List[str]]Embeddings = List[List[float]]
class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]): """ A custom embedding function using sentence-transformers with a specific model. """ def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"):# TODO: your own model name and device """ Initialize the sentence-transformer embedding function. Args: model_name: Name of the sentence-transformers model to use device: Device to run the model on ('cpu' or 'cuda') """ self.model_name = model_name or os.environ.get('SENTENCE_TRANSFORMERS_MODEL_NAME') self.device = device or os.environ.get('SENTENCE_TRANSFORMERS_DEVICE') self._model = None self._dimension = None def _ensure_model_loaded(self): """Lazy load the embedding model""" if self._model isNone: try: from sentence_transformers import SentenceTransformer self._model = SentenceTransformer(self.model_name, device=self.device) # Get dimension from model test_embedding = self._model.encode(["test"], convert_to_numpy=True) self._dimension = len(test_embedding[0]) except ImportError: raise ImportError( "sentence-transformers is not installed. " "Please install it with: pip install sentence-transformers" ) @property def dimension(self) -> int: """Get the dimension of embeddings produced by this function""" self._ensure_model_loaded() return self._dimension def __call__(self, input: Documents) -> Embeddings: """ Generate embeddings for the given documents. Args: input: Single document (str) or list of documents (List[str]) Returns: List of embedding vectors """ self._ensure_model_loaded() # Handle single string input if isinstance(input, str): input = [input] # Handle empty input ifnot input: return [] # Generate embeddings embeddings = self._model.encode( input, convert_to_numpy=True, show_progress_bar=False ) # Convert numpy arrays to lists return [embedding.tolist() for embedding in embeddings]
class OpenAIEmbeddingFunction(EmbeddingFunction[Documents]): """ A custom embedding function using Embedding API. """ def __init__(self, model_name: str = "", api_key: str = "", base_url: str = ""): """ Initialize the Embedding API embedding function. Args: model_name: Name of the Embedding API embedding model api_key: Embedding API key (if not provided, uses EMBEDDING_API_KEY env var) """ self.model_name = model_name or os.environ.get('EMBEDDING_MODEL_NAME') self.api_key = api_key or os.environ.get('EMBEDDING_API_KEY') self.base_url = base_url or os.environ.get('EMBEDDING_BASE_URL') self._dimension = None ifnot self.api_key: raise ValueError("Embedding API key is required")
def _ensure_model_loaded(self): """Lazy load the Embedding API model""" try: client = OpenAI( api_key=self.api_key, base_url=self.base_url ) response = client.embeddings.create( model=self.model_name, input=["test"] ) self._dimension = len(response.data[0].embedding) except Exception as e: raise ValueError(f"Failed to load Embedding API model: {e}")
@property def dimension(self) -> int: """Get the dimension of embeddings produced by this function""" self._ensure_model_loaded() return self._dimension def __call__(self, input: Documents) -> Embeddings: """ Generate embeddings using Embedding API. Args: input: Single document (str) or list of documents (List[str]) Returns: List of embedding vectors """ # Handle single string input if isinstance(input, str): input = [input] # Handle empty input ifnot input: return [] # Call Embedding API client = OpenAI( api_key=self.api_key, base_url=self.base_url ) response = client.embeddings.create( model=self.model_name, input=input ) # Extract Embedding API embeddings embeddings = [item.embedding for item in response.data] return embeddings
def create_embedding_function() -> EmbeddingFunction: embedding_function_type = os.environ.get('EMBEDDING_FUNCTION_TYPE') if embedding_function_type == "api": print("Using OpenAI Embedding API embedding function") return OpenAIEmbeddingFunction() elif embedding_function_type == "local": print("Using SentenceTransformer embedding function") return SentenceTransformerCustomEmbeddingFunction() elif embedding_function_type == "default": print("Using Default embedding function") return DefaultEmbeddingFunction() else: raise ValueError(f"Unsupported embedding function type: {embedding_function_type}")
复制代码
创建 Collection
在 get_or_create_collection() 方法中我们传入了 embedding_function,之后使用这个 collection 的 add() 和 query() 方法的时候就不需要传入向量了,只需传入文本,向量会由 embedding_function 自动生成。
def get_seekdb_collection(client, collection_name: str = "embeddings", embedding_function: Optional[EmbeddingFunction] = DefaultEmbeddingFunction(), drop_if_exists: bool = True): """ Get or create a collection using pyseekdb's get_or_create_collection. Args: client: seekdb client instance collection_name: Name of the collection embedding_function: Embedding function (required for automatic embedding generation) drop_if_exists: Whether to drop existing collection if it exists Returns: Collection object """ if drop_if_exists and client.has_collection(collection_name): print(f"Collection '{collection_name}' already exists, deleting old data...") client.delete_collection(collection_name) if embedding_function isNone: raise ValueError("embedding_function is required") # Use pyseekdb's native get_or_create_collection collection = client.get_or_create_collection( name=collection_name, embedding_function=embedding_function ) print(f"Collection '{collection_name}' ready!") return collection
复制代码
核心插入数据函数
def insert_embeddings(collection, data: List[Dict[str, Any]]): """ Insert data into collection. Embeddings are automatically generated by collection's embedding_function.
Args: collection: Collection object (must have embedding_function configured) data: List of data dictionaries containing 'text', 'source_file', 'chunk_index' """ try: ids = [f"{item['source_file']}_{item.get('chunk_index', 0)}"for item in data] documents = [item['text'] for item in data] metadatas = [{'source_file': item['source_file'], 'chunk_index': item.get('chunk_index', 0)} for item in data]
# Collection's embedding_function will automatically generate embeddings from documents collection.add( ids=ids, documents=documents, metadatas=metadatas )
print(f"Inserted {len(data)} items successfully") except Exception as e: print(f"Error inserting data: {e}") raise
复制代码
向量相似度搜索
results = collection.query( query_texts=[question], n_results=3, include=["documents", "metadatas", "distances"] )
复制代码
统计 Collection 中的数据情况
def get_database_stats(collection) -> Dict[str, Any]: """Get statistics about the collection.""" try: results = collection.get(limit=10000, include=["metadatas"]) ids = results.get('ids', []) if isinstance(results, dict) else [] metadatas = results.get('metadatas', []) if isinstance(results, dict) else [] unique_files = {m.get('source_file') for m in metadatas if m and m.get('source_file')} return { "total_embeddings": len(ids), "unique_source_files": len(unique_files) } except Exception as e: print(f"Error getting database stats: {e}") return {"total_embeddings": 0, "unique_source_files": 0}
复制代码
构建 RAG 系统
本模块实现了 RAG 系统的检索功能。通过将用户提出的问题转换为嵌入向量,利用 seekdb 提供的原生向量搜索能力,快速检索出与问题最相关的文档片段,为后续的生成模型提供必要的上下文信息。
导入数据
我们使用 pyseekdb 的 SDK 文档作为示例,您也可以使用自己的 Markdown 文档或者目录。
运行数据导入脚本:
# 导入单个文档uv run python seekdb_insert.py ../../README.md
# 或导入目录下的所有 Markdown 文档uv run python seekdb_insert.py path/to/your_dir
复制代码
启动应用
在 <font style="background-color:rgba(0, 0, 0, 0.06);">pyseekdb/demo/rag 路径下执行如下命令,通过 Streamlit 启动应用:
uv run streamlit run seekdb_app.py --server.port your_port
复制代码
使用 IP 和端口号(默认为 8501,可通过 --server.port 选项自定义)即可在浏览器中打开 RAG 界面。
提示: 如果使用 uv 作为包管理器,请在命令前加上 uv run 前缀,以确保使用正确的 Python 环境和依赖。
实际运行示例
以下是一个真实的查询示例,展示了 seekdb RAG 系统的实际效果:
用户问题:“What is seekdb's vector search?”
检索到的相关文档:
1. 30.experience-vector-retrieval.md Vector search overview In today's era of information explosion, users often need to quickly retrieve the information they need from massive amounts o...
2. 20.using-seekdb-in-python-mode.md Experience vector search SeekDB supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance ca...
3. 70.experience-hybrid-vector-index.md Summary Through this tutorial, you have mastered the core features of SeekDB's hybrid vector index: Simplified usage process: Achieve semantic ret...
复制代码
使用 LLM 生成的答案:
SeekDB's vector search is a powerful feature that enables efficient and accurate retrieval of data by encoding various types of data—such as text, images, and audio—into mathematical vectors and performing searches in a vector space. This approach captures deep semantic relationships within the data, going beyond traditional keyword-based search methods to deliver more relevant results.
Key capabilities of SeekDB's vector search include:
High-dimensional vector support: It supports up to 16,000 dimensions for float-type dense vectors and sparse vectors.
Multiple distance metrics: It supports various similarity or distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance.
Advanced indexing methods: Vector indexes can be created using HNSW (Hierarchical Navigable Small World) or IVF (Iterative Virtual File), enabling fast approximate nearest neighbor searches.
Incremental updates: The system allows for real-time insertion and deletion of vectors without compromising search accuracy or recall.
Hybrid search with scalar filtering: Users can combine vector similarity search with traditional attribute-based (scalar) filtering for more precise results.
Flexible access interfaces: SeekDB supports SQL access via MySQL protocol clients in multiple programming languages, as well as a Python SDK.
Automatic embedding and hybrid indexing: With hybrid vector index features, users can store raw text directly—the system automatically converts it into vectors and builds indexes.
In summary, SeekDB's vector search provides a comprehensive, high-performance solution for semantic search, particularly valuable in AI applications involving large-scale unstructured data.
这个示例展示了:
✅ 准确的信息检索:系统成功从文档中找到了相关信息
✅ 多文档整合:从 3 个不同文档中提取和整合信息
✅ 语义匹配:准确匹配了“vector search”相关的文档
✅ 结构化回答:AI 将检索到的信息整理成清晰的结构
✅ 完整性:涵盖了 seekdb 向量搜索的主要特性
✅ 专业性:回答包含了技术细节和实际应用价值
检索质量分析:
最相关文档 : experience-vector-retrieval.md - 向量搜索概览
技术细节 : using-seekdb-in-python-mode.md - 具体的技术规格
高级特性 : experience-hybrid-vector-index.md - 混合向量索引功能
快速体验
如需快速体验 seekdb RAG 系统,请参考 快速部署[3]。
(点击阅读原文,进入 seekdb 项目页面)。
参考资料
[1]
https://dashscope.aliyuncs.com/compatible-mode/v1: https://dashscope.aliyuncs.com/compatible-mode/v1
[2]
https://dashscope.aliyuncs.com/compatible-mode/v1: https://dashscope.aliyuncs.com/compatible-mode/v1
[3]
快速部署: https://github.com/oceanbase/pyseekdb/blob/main/demo/rag/README_CN.md
[4]
seekdb 项目地址:https://github.com/oceanbase/seekdb
评论