通义×Milvus：手把手教你打造专属 AI 技术顾问

作者：Zilliz

2024-11-12
上海
本文字数：4530 字
阅读完需：约 15 分钟

一、前言

对于 Milvus 的开发者和使用者来说，向量数据库的应用场景越来越广泛，但技术深度和问题复杂性也与日俱增。在构建 AI 应用、机器学习项目时，我们常常面临这样的困境：明明知道解决方案就在某处，却苦于找不到精准的技术指导。每一个 Milvus 开发者都渴望拥有一个随身的技术顾问，能即时解答向量检索、数据索引、性能优化等关键问题。想象一个完全理解你项目上下文、能秒级响应的专属 AI 助手 - 它可以帮助你节省大量技术排查时间，快速解决各种问题。通过构建属于自己的本地技术顾问，让你在使用和开发 Milvus 的路上更顺利。

二、手把手构建专属 Milvus 技术顾问

本环节中将详细介绍如何结合 Milvus 的数据集和 Qwen 本地大模型，构建一个能够理解和回答特定技术问题的 AI 技术顾问。我们将从零开始，逐步引导完成整个流程，涵盖准备训练数据集、实施微调训练以及评估模型效果等关键环节。通过本实操环节，相信即便是初学者也能成功打造出自己的专属技术顾问，进而显著提高在 Milvus 项目中的工作效率和创新能力。

说明：本文中忽略一些基础环境配置环节，若需了解请自行研究

2.1 环境要求

2.2 环境准备

检查显卡状态

[root@Qwen-main] nvidia-smi

复制代码

新建 Py 虚拟环境并激活

[root@Qwen-main] conda create -n qwen

复制代码

[root@Qwen-main] conda activate qwen

复制代码

数据集准备

https://milvus.io/docs/schema.md

本次数据集采用 Milvus 官方部分文档作为原始数据，保存后等待数据清洗与预处理

Clone 项目到本地

注意：QWEN 项目支持本地运行的模型列表如下，QWEN2 的模型不支持此项目运行

https://github.com/QwenLM/Qwen

(qwen) [root@Qwen-main] git Clone https://github.com/QwenLM/Qwen

复制代码

下载 Qwen-1.8B 模型

(qwen) [root@Qwen-main] modelscope download --model Qwen/Qwen-1_8B-Chat

复制代码

说明：本地模型存储路径

(qwen) [root@Qwen-main] ls /root/.cache/modelscope/hub/Qwen/Qwen-1_8B-Chat/

复制代码

2.3 开始微调训练

2.3.1 安装 Pytorch

注意：请参照 Pytorch 官方环境版本要求选择安装

(qwen) [root@Qwen-main] pip3 install torch torchvision torchaudio -i https://mirrors.aliyun.com/pypi/simple/

复制代码

安装 QWEN 项目所需依赖

(qwen) [root@Qwen-main] pip3 install peft -i https://mirrors.aliyun.com/pypi/simple/

复制代码

(qwen) [root@Qwen-main] pip3 install requirements_web_demo.txt -i https://mirrors.aliyun.com/pypi/simple/

复制代码

gradio<3.42mdtex2html

复制代码

(qwen) [root@Qwen-main] pip3 install requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

复制代码

transformers>=4.32.0,<4.38.0acceleratetiktokeneinopstransformers_stream_generator==0.0.4scipy

复制代码

2.3.2 启动 QWEN-1.8B 原始模型

(qwen) [root@Qwen-main] python3 web_demo.py --server-name 0.0.0.0 -c /root/.cache/modelscope/hub/Qwen/Qwen-1_8B-Chat

复制代码

向模型提问测试

说明：此时我们看到微调训练前的 QWEN-1.8 模型回答的结果并没有达到预期，希望回答的是 Milvus 的问题而不是 MongoDB 的问题

2.3.3 数据集预处理

说明：此步骤是将事先准备好的 Milvus 的原始数据集以脚本的方式进行 JSON 格式转换，处理为大模型可识别的格式。

格式转换的原因是 QWEN 是对话型的模型，它需要理解上下文，给定 JSON 格式可以更好的表示。

import jsonimport osimport argparsefrom typing import List, Dict
def convert_milvus_doc_to_conversation(doc_data: Dict) -> List[Dict]:    """将Milvus文档转换为对话训练格式"""    conversations = []
    # 遍历所有 sections    for section in doc_data.get('sections', []):        # 使用 section 标题作为基础问题        base_question = f"请详细解释Milvus文档中的 {section.get('title', '未知主题')}"
        # 从内容中提取关键信息作为答案        content_texts = []        for content in section.get('content', []):            # 处理不同类型的内容            if content.get('type') == 'paragraph':                content_texts.append(content.get('text', ''))            elif content.get('type') == 'code':                content_texts.append(content.get('text', ''))            elif content.get('type') == 'list':                # 处理列表类型的内容                list_items = content.get('items', [])                content_texts.extend(list_items)
        # 过滤空内容        content_texts = [text for text in content_texts if text.strip()]
        answer = ' '.join(content_texts)[:1000]  # 限制长度
        if answer:            conversation_item = {                "id": f"milvus_doc_{section.get('title', 'unknown')}",                "conversations": [                    {"from": "user", "value": base_question},                    {"from": "assistant", "value": answer}                ]            }            conversations.append(conversation_item)
    print(f"Converted {len(conversations)} conversations")    return conversations
def convert_milvus_dataset(input_file: str, output_file: str) -> None:    """转换Milvus文档数据集"""    all_conversations = []
    with open(input_file, 'r', encoding='utf-8') as f:        doc_data = json.load(f)
        # 如果是列表，逐个处理        if isinstance(doc_data, list):            for item in doc_data:                converted_docs = convert_milvus_doc_to_conversation(item)                all_conversations.extend(converted_docs)        else:            converted_docs = convert_milvus_doc_to_conversation(doc_data)            all_conversations.extend(converted_docs)
    # 确保输出目录存在（使用绝对路径）    output_dir = os.path.abspath(os.path.dirname(output_file))    os.makedirs(output_dir, exist_ok=True)
    # 保存转换后的数据    with open(output_file, 'w', encoding='utf-8') as f:        json.dump(all_conversations, f, ensure_ascii=False, indent=2)
    print(f"数据转换完成！共转换 {len(all_conversations)} 条对话")    print(f"转换后的数据已保存到: {output_file}")
def main():    parser = argparse.ArgumentParser(description='转换Milvus文档数据集')    parser.add_argument('--input_file', type=str, required=True, help='输入文件路径')    parser.add_argument('--output_file', type=str, required=True, help='输出文件路径')
    args = parser.parse_args()
    convert_milvus_dataset(args.input_file, args.output_file)
if __name__ == "__main__":    main()

复制代码

2.3.4 使用 Lora 进行微调训练

(qwen) [root@Qwen-main]  bash  finetune/finetune_lora_single_gpu.sh

复制代码

重要参数说明：

# 模型和数据路径配置--model_name_or_path $MODEL     # 基础预训练模型路径，影响初始模型权重--data_path $DATA               # 训练数据集路径，直接决定训练内容
# 训练轮次和批次设置--num_train_epochs 1            # 训练轮次，影响模型学习深度--per_device_train_batch_size 16 # 每个设备的批次大小，影响梯度计算和显存占用--gradient_accumulation_steps 1  # 梯度累积步数，可以模拟更大的批次大小
# 学习率和优化策略--learning_rate 2e-3            # 学习率，控制模型权重更新速度--weight_decay 0.005            # 权重衰减，防止过拟合--warmup_ratio 0.01             # 学习率预热比例，缓解初期训练不稳定--lr_scheduler_type "constant"  # 学习率调度策略
# 模型训练限制--model_max_length 128          # 输入序列最大长度--max_steps 2000                # 总训练步数
# 性能和显存优化--gradient_checkpointing True   # 梯度检查点，减少显存占用--lazy_preprocess True          # 延迟预处理，提高数据加载效率
# LoRA 特定配置--use_lora                      # 启用 LoRA 微调

复制代码

查看微调训练后的模型

2.3.5 合并模型

说明：合并模型的意思，就是将在特定领域（如 Milvus）数据集上微调的模型权重，永久地集成到原始预训练模型中，生成一个具有领域专属知识和能力的定制模型。

[root@Qwen-main]  python3 merge_lora.py

复制代码

import osfrom peft import AutoPeftModelForCausalLMfrom transformers import AutoTokenizer, GenerationConfigimport torch
def merge_lora(    lora_model_path="/root/qwen-wt/Qwen-main/output_qwen/checkpoint-2000",    base_model_path="/root/.cache/modelscope/hub/Qwen/Qwen-1_8B-Chat",    save_path="/root/qwen-wt/Qwen-main/Milvus-model"):    print("Starting model merge...")
    # 加载 LoRA 模型    print(f"Loading LoRA model from {lora_model_path}")    model = AutoPeftModelForCausalLM.from_pretrained(        lora_model_path,        device_map="auto",        torch_dtype=torch.float16    )
    # 合并模型    print("Merging models...")    merged_model = model.merge_and_unload()
    # 保存合并后的模型    print(f"Saving merged model to {save_path}")    os.makedirs(save_path, exist_ok=True)    merged_model.save_pretrained(save_path)
    # 保存tokenizer和generation_config    print("Saving tokenizer and configs...")    tokenizer = AutoTokenizer.from_pretrained(        base_model_path,        trust_remote_code=True    )    tokenizer.save_pretrained(save_path)
    # 复制 generation_config.json    generation_config = GenerationConfig.from_pretrained(        base_model_path,        trust_remote_code=True    )    generation_config.save_pretrained(save_path)
    print(f"Merge completed! Merged model saved to: {save_path}")    print("\nYou can now use the merged model with web demo:")    print(f"python web_demo.py --server-name 0.0.0.0 -c {save_path}")
if __name__ == "__main__":    merge_lora()

复制代码

2.3.6 启动新模型并访问测试

说明：此时我们再次向模型提出和刚才一样的问题，得到的结果是符合预期的。

[root@Qwen-main] python web_demo.py --server-name 0.0.0.0 -c /root/qwen-wt/Qwen-main/Milvus-model

复制代码

三、总结：大模型微调训练那些事

本次微调训练只是一个起点。对于想要构建专属技术顾问的开发者或用户来说，我们还需要在数据质量、模型选择和训练策略上下更多功夫。想要微调出效果更好的模型推荐从两个方向继续深化：一是优化训练数据集，确保覆盖更多技术场景；二是尝试更精细的指令微调，提高模型在特定领域的专业性。技术的进步从来都是在不断迭代中实现的。

完整代码：

链接: https://pan.baidu.com/s/1bqWcIYBw2sUy36t9ol17Ww?pwd=1234 提取码: 1234

作者介绍

Zilliz 黄金写手：尹珉

发布于: 刚刚阅读数: 6

原文链接:【http://xie.infoq.cn/article/f863a11b705cbf0cf853d0f27】。

Zilliz

关注

Data Infrastructure for AI Made Easy 2021-10-09 加入

还未添加个人简介

发布

暂无评论

创作场景

通义×Milvus：手把手教你打造专属 AI 技术顾问

一、前言

二、手把手构建专属 Milvus 技术顾问

2.1 环境要求

2.2 环境准备

检查显卡状态

新建 Py 虚拟环境并激活

数据集准备

Clone 项目到本地

下载 Qwen-1.8B 模型

2.3 开始微调训练

2.3.1 安装 Pytorch

2.3.2 启动 QWEN-1.8B 原始模型

2.3.3 数据集预处理

2.3.4 使用 Lora 进行微调训练

2.3.5 合并模型

2.3.6 启动新模型并访问测试

三、总结：大模型微调训练那些事

Zilliz

评论