写点什么

昇腾 NPU 上基于 MindIE 服务的 AIME 和 MATH500 测评方案

作者:小顺637
  • 2025-06-05
    上海
  • 本文字数:5238 字

    阅读完需:约 17 分钟

背景

当前对 DeepSeek-R1 此类带推理 think 思维链的模型进行模型能力测评缺乏一个较准确的方,MindIE 当前不能对 DeepSeek 报告中提到的几个数据集(AIME 2024、AIME 2025、MATH-500、GPQA 等)进行模型效果评测。Open R1 由 huggingface 出品,是当前最火的 DeepSeek-R1 全开源复现。我们可以参考 Open R1 项目的评测方法,基于 lighteval 进行评测。

约束条件

  • 硬件约束:本案例硬件信息为 Atlas 800I A2 ,约束条件为模型在 MindIE 服务可成功部署即可。

  • MindIE 版本:同上,模型可成功部署即可。

  • lighteval 版本:本案例验证 lighteval 版本为 commit-id:ed084813e0bd12d82a06d9f913291fdbee774905,新版本代码可能需自行验证。

模型权重下载

# 安装modelscopepip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple/
# 使用modelscope下载权重modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --local_dir ./DeepSeek-R1-Distill-Qwen-7B
复制代码

MindIE 服务化部署

镜像下载

下载链接:https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f
复制代码

启动容器

# 查看镜像docker images
# 创建并启动容器sudo docker run -it --name lighteval_test \ --network=host --shm-size=128G \ --privileged=true \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/sbin:/usr/local/sbin \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /home:/home \ -v /tmp:/tmp \ -v /data:/data \ -v `pwd`:/workspace \ -w /workspace \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC1-800I-A2-py311-openeuler24.03-lts /bin/bash
复制代码

环境变量

# 在容器中设置如下环境变量:source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shsource /usr/local/Ascend/mindie/set_env.shsource /usr/local/Ascend/atb-models/set_env.sh
复制代码

启动 MindIE 服务

修改 MindIE 配置文件

cd /usr/local/Ascend/mindie/latest/mindie-service/confvim config.json
修改config.json的配置项:修改"httpsEnabled"为false,修改"npuDeviceIds"为[[0,1,2,3,4,5,6,7]],修改"modelName"为"qwen_distill_7b"修改"modelWeightPath"为/home/test/DeepSeek-R1-Distill-Qwen-7B修改"worldSize"为8
其它config.json的参考配置:maxSeqlen为最大输入加输出长度,此处建议修改为40960maxInputTokenLen:根据实际输入长度配置,此处建议8192maxPrefillTokens:建议和maxSeqlen保持一致即可maxIterTimes:由于带think思维链模型输出较长,建议设为32786
复制代码

修改模型配置文件

vim /home/test/DeepSeek-R1-Distill-Qwen-7B/config.json修改"torch_dtype": "float16",
复制代码

修改模型路径权限

chmod -R 750 /home/test/DeepSeek-R1-Distill-Qwen-7B
复制代码

启动 MindIE 服务

cd /usr/local/Ascend/mindie/latest/mindie-service/bin./mindieservice_daemon
复制代码


启动成功显示:


Daemon starts success!
复制代码

MindIE 服务测试

curl 127.0.0.1:1025/generate -d '{"prompt": "What is deep learning?","max_tokens": 32,"stream": false,"do_sample":true,"repetition_penalty": 1.00,"temperature": 0.01,"top_p": 0.001,"top_k": 1,"model": "qwen_distill_7b"}'
复制代码


注意:上面 curl 语句中的 1025 端口与 qwen_distill_7b 分别与/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json 配置文件里的 port 和 modelName 对应

安装 lighteval

python 包安装(不推荐)

pip install lighteval
复制代码


python 包安装可能由于版本问题会和 open-r1 提供的评测脚本有代码冲突,建议选择通过源码安装

源码安装(推荐)

git clone https://github.com/huggingface/lighteval.gitcd lighteval# 本次验证参考版本commit-id:ed084813e0bd12d82a06d9f913291fdbee774905git checkout ed084813e0bd12d82a06d9f913291fdbee774905pip install .pip install .[math]
复制代码

编写 evaluate.py

参考 Open R1 项目的的 src/open_r1/evaluate.py 准备好 evaluate.py 文件,用于自定义评测任务:


# Copyright 2025 The HuggingFace Team. All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.
"""Custom evaluation tasks for LightEval."""
import random
from lighteval.metrics.dynamic_metrics import ( ExprExtractionConfig, IndicesExtractionConfig, LatexExtractionConfig, multilingual_extractive_match_metric,)from lighteval.tasks.lighteval_task import LightevalTaskConfigfrom lighteval.tasks.requests import Docfrom lighteval.utils.language import Language
latex_gold_metric = multilingual_extractive_match_metric( language=Language.ENGLISH, fallback_mode="first_match", precision=5, gold_extraction_target=(LatexExtractionConfig(),), # Match boxed first before trying other regexes pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)), aggregation_function=max,)
expr_gold_metric = multilingual_extractive_match_metric( language=Language.ENGLISH, fallback_mode="first_match", precision=5, gold_extraction_target=(ExprExtractionConfig(),), # Match boxed first before trying other regexes pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)), aggregation_function=max,)
gpqa_metric = multilingual_extractive_match_metric( language=Language.ENGLISH, gold_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")], pred_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")], precision=5,)
def prompt_fn(line, task_name: str = None): """Assumes the model is either prompted to emit \\boxed{answer} or does so automatically""" return Doc( task_name=task_name, query=line["problem"], choices=[line["solution"]], gold_index=0, )
def aime_prompt_fn(line, task_name: str = None): return Doc( task_name=task_name, query=line["problem"], choices=[line["answer"]], gold_index=0, )
def gpqa_prompt_fn(line, task_name: str = None): """Prompt template adapted from simple-evals: https://github.com/openai/simple-evals/blob/83ed7640a7d9cd26849bcb3340125002ef14abbe/common.py#L14""" gold_index = random.randint(0, 3) choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]] choices.insert(gold_index, line["Correct Answer"]) query_template = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.\n\n{Question}\n\nA) {A}\nB) {B}\nC) {C}\nD) {D}" query = query_template.format(A=choices[0], B=choices[1], C=choices[2], D=choices[3], Question=line["Question"])
return Doc( task_name=task_name, query=query, choices=["A", "B", "C", "D"], gold_index=gold_index, instruction=query, )
# Define tasksaime24 = LightevalTaskConfig( name="aime24", suite=["custom"], prompt_function=aime_prompt_fn, hf_repo="HuggingFaceH4/aime_2024", hf_subset="default", hf_avail_splits=["train"], evaluation_splits=["train"], few_shots_split=None, few_shots_select=None, generation_size=32768, metric=[expr_gold_metric], version=1,)aime25 = LightevalTaskConfig( name="aime25", suite=["custom"], prompt_function=aime_prompt_fn, hf_repo="yentinglin/aime_2025", hf_subset="default", hf_avail_splits=["train"], evaluation_splits=["train"], few_shots_split=None, few_shots_select=None, generation_size=32768, metric=[expr_gold_metric], version=1,)math_500 = LightevalTaskConfig( name="math_500", suite=["custom"], prompt_function=prompt_fn, hf_repo="HuggingFaceH4/MATH-500", hf_subset="default", hf_avail_splits=["test"], evaluation_splits=["test"], few_shots_split=None, few_shots_select=None, generation_size=32768, metric=[latex_gold_metric], version=1,)gpqa_diamond = LightevalTaskConfig( name="gpqa:diamond", suite=["custom"], prompt_function=gpqa_prompt_fn, hf_repo="Idavidrein/gpqa", hf_subset="gpqa_diamond", hf_avail_splits=["train"], evaluation_splits=["train"], few_shots_split=None, few_shots_select=None, generation_size=32768, # needed for reasoning models like R1 metric=[gpqa_metric], stop_sequence=[], # no stop sequence, will use eos token trust_dataset=True, version=1,)
# Add tasks to the tableTASKS_TABLE = []TASKS_TABLE.append(aime24)TASKS_TABLE.append(aime25)TASKS_TABLE.append(math_500)TASKS_TABLE.append(gpqa_diamond)
# MODULE LOGICif __name__ == "__main__": print([t["name"] for t in TASKS_TABLE]) print(len(TASKS_TABLE))
复制代码

测评方式说明

lighteval 提供了几个主要的入口点来进行模型评估,但是在 NPU 设备上使用 vllm 和 tgi 模式可能还需要对代码进行适配,因此我们可以使用的就是 accelerate 和 endpoint openai 两个模式。但是 accelerate 模型存在输出较短和推理速度较慢的问题,==因此推荐基于 openai 模式,并通过 MindIE 部署模型推理服务实现评测==。

使用 openai 模式测评(基于 lighteval+MindIE 服务)

lighteval 也支持通过 openai 第三方库对模型推理服务进行评分,但当前评测框架没有对本地模型服务的情况进行适配,通过分析代码发现适配量不大,因此采用此模式对 MindIE 服务进行模型评测。

lighteval 适配本地推理服务

查找 lighteval 安装路径

通过 pip show lighteval 找到 lighteval 安装路径



修改 openai_model.py

修改 lighteval python 安装包路径下“lighteval/models/endpoints/openai_model.py”代码:


  • 修改 base_url 读取方式,使其可以从环境变量读取


  • 请求的 model 字段改成现有的 MindIE 模型服务的模型名,此处为"qwen_distill_7b"


评测命令

==说明:==


  • 下面 OPENAI_BASE_URL 需配置为服务地址,注意需保证服务可访问,可通过 curl 命令验证;OPENAI_API_KEY 未使用,但是仍需配置,要求非空。

  • --custom-tasks 后填写前面自定义的 evaluate.py 文件路径

配置环境变量

MODEL="/home/test/DeepSeek-R1-Distill-Qwen-7B"MODEL_ARGS="$MODEL"OUTPUT_DIR=./data/evals/$MODELexport OPENAI_BASE_URL="http://127.0.0.1:1025/v1"export OPENAI_API_KEY="test"
复制代码

AIME 2024 测评

TASK=aime24
lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \ --custom-tasks evaluate.py \--output-dir $OUTPUT_DIR
复制代码

MATH-500 测评

TASK=math_500
lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \ --custom-tasks evaluate.py \ --output-dir $OUTPUT_DIR
复制代码

测评结果

AIME2024 测评结果


MATH-500 测评结果


DeepSeek 官方测评结果


用户头像

小顺637

关注

还未添加个人签名 2023-01-19 加入

还未添加个人简介

评论

发布
暂无评论
昇腾NPU上基于MindIE服务的AIME和MATH500测评方案_大模型_小顺637_InfoQ写作社区