如何处理海量数据？基于 Milvus 向量数据库的高度可扩展性

作者：Zilliz

2024-10-24
上海
本文字数：8872 字
阅读完需：约 29 分钟

经常有开发者问我：“在可以使用 NumPy 或 FAISS 的情况下，为什么还要使用像 Milvus 这样的向量数据库呢？”我的答案是：“有很多原因——可扩展性、易于管理、数据持久性。以及最重要的一点：可扩展性。”在处理数百万或数十亿向量时，您需要一个专为处理这种体量数据而构建的解决方案。

本文将介绍如何使用 Milvus 处理一个包含 4000 万个向量的示例数据集。本文还将展示如何通过元数据过滤等功能显著提升您的搜索结果。元数据过滤是普通向量数据库中经常缺失的重要功能。

Milvus：高度可扩展的向量数据库

Milvus 是一款流行的开源向量数据库，通过高性能和可扩展的向量相似性搜索为 AI 应用提供支持。

以下是一些关于可扩展性的 Milvus 关键特性：

分布式架构：Milvus 采用了分布式架构，能够无缝水平扩展，并将数据和工作负载分布到多个节点上。这种能力确保了 Milvus 即使在重负载下也能保持高可用性和弹性。
优化索引：Milvus 支持不同的索引类型，如 IVF_FLAT 和 HNSW，还支持 GPU 索引以加速搜索。如果您需要降低向量数据库的使用成本，您还可以使用 DiskANN。虽然搜索速度会有轻微下滑，但成本会显著降低。
数十亿级向量搜索：Milvus 的一个显著特点是能够高效地处理和搜索数十亿级向量。这一能力对于需要高速和大规模向量相似性搜索的应用至关重要。
多样的计算选择：Milvus 可以利用 GPU 和 CPU 的力量，智能地将任务分配给最适合的硬件以获得最佳性能。这种能力允许并行处理，并显著提高了计算密集型操作的速度。

数据集概览：百万条 Wikipedia 文章向量

我们使用的数据集中已将维基百科（Wikipedia）的文章通过 Cohere Embedding 模型转换为 Embedding 向量。该数据集可通过HuggingFace 免费获取。

每个示例包含了一篇完整的 Wikipedia 文章，并进行了数据清理以去除 Markdown 格式和不需要的部分（如引用等）。

数据集包含超过 300 种语言，但对于我们的用例而言，我们将专注于英语文章，即其中的 4150 万个向量。值得注意的是，这些 Embedding 是跨语言的向量！这意味着我们可以跨多种语言进行搜索，并依赖模型的能力来找到相似的含义——即使是在不同的语言中！

Milvus 快速开始

本章节将介绍如何快速上手使用 Milvus，包括安装 Milvus SDK 或设置 Zilliz Cloud、连接至 Milvus、创建 Collection 等。

安装 Milvus SDK

首先，通过运行 pip install pymilvus 安装 Milvus SDK。为了简化操作，我们将在 Zilliz Cloud （全托管的 Milvus 服务）上部署 Milvus 实例。根据您的数据规模，您也可以在 Kubernetes 上部署 Milvus 或使用 Zilliz Cloud。

如果您数据规模较大，超过一百万个向量，您可以在 Docker 或 Kubernetes 上部署 Milvus server。如果选择这种部署方式，请使用 Server URI 作为 URI，例如 http://localhost:19530。
如果您选择使用 Zilliz Cloud——全托管的 Milvus 云服务，请输入对应的 uri 和 token，它们分别对应 Zilliz Cloud 中的公共 Endpoint 和 API 密钥。

实践指南

连接 Milvus：首先连接至 Milvus 实例。

from pymilvus import MilvusClient
client = MilvusClient(uri=<ZILLIZ_CLOUD_URI>, token=<ZILLIZ_TOKEN>)

复制代码

定义 Schema：定义 Milvus Collection 的 Schema。

# Define the schema for the collectionembedding_dim = 1024
schema = [            {"name": "id", "dtype": "int64", "is_primary": True, "auto_id": True},            {"name": "text_vector", "dtype": "float_vector", "dim": embedding_dim},            {"name": "title", "dtype": "varchar", "max_length": 5000},            {"name": "text", "dtype": "varchar", "max_length": 5000},        ]

复制代码

创建 Collection：现在，让我们创建 Collection 以存储 Embedding 向量。

# Create the collection if it doesn't existcollection_name = "cohere_embeddings"
if not client.has_collection(collection_name):    client.create_collection(collection_name=collection_name, schema=schema)

复制代码

创建索引：为了实现更高效的向量搜索，我们需要为向量字段创建索引。向量索引是一种特殊的索引类型，专为向量存储和搜索设计。

# Define and create indexindex_params = {    "metric_type": "COSINE",    "index_type": "HNSW",    "params": {"M": 8, "efConstruction": 64},}client.create_index(collection_name=collection_name, field_name="text_vector", index_params=index_params)
# Load the collectionclient.load_collection(collection_name=collection_name)

复制代码

插入数据：使用 HuggingFace 的数据集填充 cohere_embeddings Collection。根据您的资源实际情况，可以该数据集的子集，例如仅下载特定语言的数据，然后将数据上传到 S3 或 Google cloud storage 这样的对象存储桶中。或者，您也可以直接从 HuggingFace 进行流式导入。

如果您选择流式导入数据集，您可以直接遍历数据集而非下载整个数据集。这种方式适合磁盘空间不足以下载整个数据集，或者想在数据集下载完成前就使用数据的场景。

# Insert data into our collectiondef insert_batch(client, collection_name, batch_data):    client.insert(collection_name=collection_name, data=batch_data)    batch_data.clear()
# Get the data from HuggingFace and create some batchdef insert_data(client, collection_name):    batch_size = 1000  # Adjust the batch size as needed    batch_data = []
    docs = load_dataset(        "Cohere/wikipedia-2023-11-embed-multilingual-v3",        "en", # English only        split="train",        streaming=True, # Allows us to iterate over the dataset    )
    for doc in tqdm(docs, desc="Streaming and preparing data for Milvus"):        title = doc["title"][:4500] # Titles can be very long        text = doc["text"][:4500] # Text can be very long        emb = doc["emb"]  # The embedding vector
        batch_data.append({"title": title, "text_vector": emb, "text": text})        if len(batch_data) >= batch_size:            insert_batch(client, collection_name, batch_data)    if batch_data:        insert_batch(client, collection_name, batch_data)

复制代码

标量过滤

在处理数百万或数十亿的向量时，过滤功能就不仅仅是一个锦上添花的功能——它是必不可少的功能。

以下是使用 Milvus 过滤的好处：

Bitset：Milvus 使用 Bitset 来表示哪些向量符合过滤条件。多亏底层的 CPU 操作可以实现快速的 Bitset 操作。这就像是拥有一台可以瞬间筛选数十亿数据的超级计算机。
减少搜索空间：与其他一些解决方案（例如 pgvector）不同，Milvus 在运行向量相似性搜索之前就进行元数据过滤，极大地减少了需要处理的向量数量。Milvus 还支持您在毫秒级将搜索范围从数十亿个向量缩小到真正需要的数千个向量。
标量索引（可选）：对于那些经常需要过滤的字段，Milvus 支持您为其创建标量索引。默认情况下，Milvus 不会自动创建标量索引，但当您手动设置标量索引后，可以极大地加速过滤。

Milvus 的美妙之处在于，它不仅提供了这些功能，而且这些功能可应用于大规模数据——无论您需要处理 4000 万个向量还是 400 亿个向量。

进行准确过滤搜索

查找特定标题的文档。

FILTER_TITLE = "British Arab Commercial Bank"
res = milvus_client.query(    collection_name=collection_name,    filter=f'title like "{FILTER_TITLE}"',    output_fields=["title", "text"])for elt in res:     pprint(elt)

复制代码

返回的文档标题为 British Arab Commercial Bank。

{'id': 450933285225527270, 'text': 'The British Arab Commercial Bank PLC (BACB) is an international '         'wholesale bank incorporated in the United Kingdom that is authorised '         'by the Prudential Regulation Authority (PRA) and regulated by the '         'PRA and the Financial Conduct Authority (FCA). It was founded in '         '1972 as UBAF Limited, adopted its current name in 1996, and '         'registered as a public limited company in 2009. The bank has clients '         'trading in and out of developing markets in the Middle East and '         'Africa.', 'title': 'British Arab Commercial Bank'}{'id': 450933285225527271, 'text': 'BACB has a head office in London, and three representative offices '         'in Algiers in Algeria, Tripoli in Libya and Abidjan in the Cote '         "D'Ivoire. The bank has 17 sister banks across Europe, Asia and "         'Africa. It is owned by three main shareholders - the Libyan Foreign '         'Bank (87.80%), Banque Centrale Populaire (6.10%) and Banque '         "Extérieure d'Algérie (6.10%).", 'title': 'British Arab Commercial Bank'}

复制代码

使用 Prefix/Infix/Postfix 进行过滤搜索

搜索包含特定词汇的文档。

res = milvus_client.query(    collection_name=collection_name,    filter='text like "%Calectasia%"', # Infix    # filter='text like "Calect%"', # Suffix    # filter='text like "%lectasia"', # Prefix    output_fields=["title", "text"],    limit=5,)for elt in res:     pprint(elt)

复制代码

返回的文档中包含 "Calectasia" 一词。

{'id': 450933285225527360, 'text': 'Calectasia is a genus of about fifteen species of flowering plants '         'in the family Dasypogonaceae and is endemic to south-western '         'Australia. Plants is this genus are small, erect shrubs with '         'branched stems covered by leaf sheaths. The flowers are star-shaped, '         'lilac-blue to purple and arranged singly on the ends of short '         'branchlets.', 'title': 'Calectasia'}{'id': 450933285225527361, 'text': 'Plants in the genus Calectasia are small, often rhizome-forming '         'shrubs with erect, branched stems with sessile leaves arranged '         'alternately along the stems,  long and about  wide, the base held '         'closely against the stem and the tip pointed. The flowers are '         'arranged singly on the ends of branchlets and are bisexual, the '         'three sepals and three petals are similar to each other, and joined '         'at the base forming a short tube but spreading, forming a star-like '         'pattern with a metallic sheen. Six bright yellow or orange stamens '         'form a tube in the centre of the flower with a thin style extending '         'beyond the centre of the tube.', 'title': 'Calectasia'}

复制代码

使用 "Not In" 进行过滤搜索

在搜索中排除特定标题的文章。

res = milvus_client.query(    collection_name=collection_name,    filter='title not in ["British Arab Commercial Bank", "Calectasia"]',    output_fields=["title", "text"],    limit=10,)for elt in res:     pprint(elt)

复制代码

返回的文档标题中既不包含 "British Arab Commercial Bank"，也不包含 "Calectasia"。

{'id': 450933285225527281, 'text': 'The Commonwealth Skyranger, first produced as the Rearwin Skyranger, '         'was the last design of Rearwin Aircraft before the company was '         'purchased by a new owner and renamed Commonwealth Aircraft.  It was '         'a side-by-side, two-seat, high-wing taildragger.', 'title': 'Commonwealth Skyranger'}{'id': 450933285225527282, 'text': 'The Rearwin company had specialized in aircraft powered by small '         'radial engines, such as their Sportster and Cloudster, and had even '         'purchased the assets of LeBlond Engines to make small radial engines '         'in-house in 1937. By 1940, however, it was clear Rearwin would need '         'a design powered by a small horizontally opposed engine to remain '         'competitive. Intended for sport pilots and flying businessmen, the '         '"Rearwin Model 165" first flew on April 9, 1940. Originally named '         'the "Ranger," Ranger Engines (who also sold several engines named '         '"Ranger") protested, and Rearwin renamed the design "Skyranger." The '         'overall design and construction methods allowed Rearwin to take '         'orders for Skyrangers then deliver the aircraft within 10 weeks.', 'title': 'Commonwealth Skyranger'}

复制代码

Cohere 🤝 Milvus: 轻松实现高效的向量相似性搜索

让我们利用 Cohere Embedding 技术进行相似性搜索。我们会将 Who founded Wikipedia 转换为 Embedding 向量，并用这个问题的向量在 Milvus 数据库中进行检索。这个模型与此前用于编码 Wikipedia 文章的模型相同。

PyMilvus 实现了 Milvus 与 Cohere 之间的集成。我们将使用 PyMilvus 的模型。通过 pip install "pymilvus[model]" 命令即可完成安装。

from pymilvus.model.dense import CohereEmbeddingFunction
cohere_ef = CohereEmbeddingFunction(    model_name="embed-multilingual-v3.0",    input_type="search_query",    embedding_types=["float"])
query = 'Who founded Wikipedia'embedded_query = cohere_ef.encode_queries([query])
response = embedded_query[0]print(response[:10])

复制代码

以下命令将返回与 Wikipedia 创始人最相关的 Wikipedia 文章。

res = milvus_client.search(data=response, collection_name=collection_name, output_fields=["text"], limit=3)
for elt in res:    pprint(elt)

复制代码

结果如下：

[{'distance': 0.7344469428062439,  'entity': {'text': 'Larry Sanger and Jimmy Wales are the ones who started '                     'Wikipedia. Wales is credited with defining the goals of '                     'the project. Sanger created the strategy of using a wiki '                     "to reach Wales' goal. On January 10, 2001, Larry Sanger "                     'proposed on the Nupedia mailing list to create a wiki as '                     'a "feeder" project for Nupedia. Wikipedia was launched '                     'on January 15, 2001. It was launched as an '                     'English-language edition at www.wikipedia.com, and '                     'announced by Sanger on the Nupedia mailing list. '                     'Wikipedia\'s policy of "neutral point-of-view" was '                     'enforced in its initial months and was similar to '                     'Nupedia\'s earlier "nonbiased" policy. Otherwise, there '                     "weren't very many rules initially, and Wikipedia "                     'operated independently of Nupedia.'},  'id': 450933285241797095}, {'distance': 0.7239157557487488,  'entity': {'text': 'Wikipedia was originally conceived as a complement to '                     'Nupedia, a free on-line encyclopedia founded by Jimmy '                     'Wales, with articles written by highly qualified '                     'contributors and evaluated by an elaborate peer review '                     'process. The writing of content for Nupedia proved to be '                     'extremely slow, with only 12 articles completed during '                     'the first year, despite a mailing-list of interested '                     'editors and the presence of a full-time editor-in-chief '                     'recruited by Wales, Larry Sanger. Learning of the wiki '                     'concept, Wales and Sanger decided to try creating a '                     'collaborative website to provide an additional source of '                     'rapidly produced draft articles that could be polished '                     'for use on Nupedia.'},  'id': 450933285225827849}, {'distance': 0.7191773653030396,  'entity': {'text': "The foundation's creation was officially announced by "                     'Wikipedia co-founder Jimmy Wales, who was running '                     'Wikipedia within his company Bomis, on June 20, 2003.'},  'id': 450933285242058780}]

复制代码

我们将检查集群的性能指标。其中，我们将重点关注搜索查询的延迟。我们可以使用 Zilliz Cloud API 来检索平均延时和 P99 延时。

首先，让我们来查看平均延时。

> curl --request POST \     --url https://api.cloud.zilliz.com/v2/clusters/<cluster_id>/metrics/query \     [...]       "metricQueries": [           {               "name": "REQ_SEARCH_LATENCY",               "stat": "AVG"           }    }'    [{"name":"REQ_SEARCH_LATENCY","stat":"AVG","unit":"millisecond","values":[{"timestamp":"2024-08-26T11:09:53Z","value":"2.0541596873255163"}]}]

复制代码

平均延时为 2.05 毫秒。

让我们来查看 P99 延时：

> curl --request POST \     --url https://api.cloud.zilliz.com/v2/clusters/<cluster_id>/metrics/query \     [...]       "metricQueries": [           {               "name": "REQ_SEARCH_LATENCY",               "stat": "P99"           }    }'[{"name":"REQ_SEARCH_LATENCY","stat":"P99","unit":"millisecond","values":[{"timestamp":"2024-08-26T11:09:53Z","value":"4.949999999999999"}]}]

复制代码

P99 延时为 4.95 毫秒。

这些结果反映了 Milvus 出色的性能：平均查询延时仅为 2 毫秒左右，99% 的查询在不到 5 毫秒内完成。这证明了 Milvus 集群即使在处理大规模数据时，也能展现出出色的搜索效率。

使用 Milvus 搭建基础 RAG 系统

检索增强生成（Retrieval Augmented Generation，RAG）是一种先进的 AI 技术，用于减轻 ChatGPT 之类的大语言模型（LLM）会出现的幻觉问题。

在下面的示例中，我们可以使用从 Milvus 中获取的结果来创建一个简单的 RAG 系统，允许 LLM 访问并利用存储在 Milvus 中的信息。

context = "\n".join(    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])
question = "Who created Wikipedia?"
SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""
USER_PROMPT = f"""    Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.    <context>    {context}    </context>    <question>    {question}    </question>"""

复制代码

from openai import OpenAI
client = OpenAI(    base_url = 'http://localhost:11434/v1',    api_key='ollama', # required, but unused)
response = client.chat.completions.create(    model="llama3.1",    messages=[        {"role": "system", "content": SYSTEM_PROMPT},        {"role": "user", "content": USER_PROMPT},    ],)print(response.choices[0].message.content)

复制代码

According to the provided context, Larry Sanger and Jimmy Wales are the ones who started Wikipedia. Specifically, Jimmy Wales defined the goals of the project, while Larry Sanger created the strategy of using a wiki to achieve those goals.

复制代码

总结

Milvus 提供了一套强大、可扩展的解决方案，能处理数十亿的向量。凭借其强大功能（如过滤和分布式架构等），Milvus 能够满足各种严格的性能要求，是 AI 应用的理想选择。

我们的性能测试显示：

平均查询延时：2.05 毫秒
P99 延时：4.95 毫秒

这些结果表明，即使处理百万级向量，Milvus 性能始终出色——低于 5 毫秒的搜索延时。因此，如果您正在寻找能够处理大规模数据集并能够快速提供快速搜索结果的向量数据库服务，Milvus 绝对值得探索。

如果您喜欢这篇博客文章，欢迎在 GitHub 为我们点星🌟。您也可以通过加入我们的 Milvus 社区分享您的经验。

发布于: 刚刚阅读数: 2

原文链接:【http://xie.infoq.cn/article/cb5f273147f08d56ca9197156】。

Zilliz

关注

Data Infrastructure for AI Made Easy 2021-10-09 加入

还未添加个人简介

发布

暂无评论

创作场景

如何处理海量数据？基于 Milvus 向量数据库的高度可扩展性

Milvus：高度可扩展的向量数据库

数据集概览：百万条 Wikipedia 文章向量

Milvus 快速开始

安装 Milvus SDK

实践指南

标量过滤

进行准确过滤搜索

使用 Prefix/Infix/Postfix 进行过滤搜索

使用 "Not In" 进行过滤搜索

Cohere 🤝 Milvus: 轻松实现高效的向量相似性搜索

使用 Milvus 搭建基础 RAG 系统

更多 Milvus 功能

总结

Zilliz

评论