昇腾 NPU 上基于 MindIE 服务的 AIME 和 MATH500 测评方案

2025-06-05
上海
本文字数：5238 字
阅读完需：约 17 分钟

背景

当前对 DeepSeek-R1 此类带推理 think 思维链的模型进行模型能力测评缺乏一个较准确的方，MindIE 当前不能对 DeepSeek 报告中提到的几个数据集（AIME 2024、AIME 2025、MATH-500、GPQA 等）进行模型效果评测。Open R1 由 huggingface 出品，是当前最火的 DeepSeek-R1 全开源复现。我们可以参考 Open R1 项目的评测方法，基于 lighteval 进行评测。

约束条件

硬件约束：本案例硬件信息为 Atlas 800I A2 ，约束条件为模型在 MindIE 服务可成功部署即可。
MindIE 版本：同上，模型可成功部署即可。
lighteval 版本：本案例验证 lighteval 版本为 commit-id：ed084813e0bd12d82a06d9f913291fdbee774905，新版本代码可能需自行验证。

模型权重下载

# 安装modelscopepip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple/
# 使用modelscope下载权重modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --local_dir ./DeepSeek-R1-Distill-Qwen-7B

复制代码

MindIE 服务化部署

镜像下载

下载链接：https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f

复制代码

启动容器

# 查看镜像docker images
# 创建并启动容器sudo docker run -it --name lighteval_test \    --network=host --shm-size=128G \    --privileged=true \    --device=/dev/davinci_manager \    --device=/dev/hisi_hdc \    --device=/dev/devmm_svm \    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \    -v /usr/local/dcmi:/usr/local/dcmi \    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \    -v /usr/local/sbin:/usr/local/sbin \    -v /etc/ascend_install.info:/etc/ascend_install.info \    -v /home:/home \    -v /tmp:/tmp \    -v /data:/data \    -v `pwd`:/workspace \    -w /workspace \    swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC1-800I-A2-py311-openeuler24.03-lts /bin/bash

复制代码

环境变量

# 在容器中设置如下环境变量：source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shsource /usr/local/Ascend/mindie/set_env.shsource /usr/local/Ascend/atb-models/set_env.sh

复制代码

启动 MindIE 服务

修改 MindIE 配置文件

cd /usr/local/Ascend/mindie/latest/mindie-service/confvim config.json
修改config.json的配置项：修改"httpsEnabled"为false,修改"npuDeviceIds"为[[0,1,2,3,4,5,6,7]],修改"modelName"为"qwen_distill_7b"修改"modelWeightPath"为/home/test/DeepSeek-R1-Distill-Qwen-7B修改"worldSize"为8
其它config.json的参考配置：maxSeqlen为最大输入加输出长度，此处建议修改为40960maxInputTokenLen：根据实际输入长度配置，此处建议8192maxPrefillTokens：建议和maxSeqlen保持一致即可maxIterTimes：由于带think思维链模型输出较长，建议设为32786

复制代码

修改模型配置文件

vim /home/test/DeepSeek-R1-Distill-Qwen-7B/config.json修改"torch_dtype": "float16",

复制代码

修改模型路径权限

chmod -R 750 /home/test/DeepSeek-R1-Distill-Qwen-7B

复制代码

启动 MindIE 服务

cd /usr/local/Ascend/mindie/latest/mindie-service/bin./mindieservice_daemon

复制代码

启动成功显示：

Daemon starts success！

复制代码

MindIE 服务测试

curl 127.0.0.1:1025/generate -d '{"prompt": "What is deep learning?","max_tokens": 32,"stream": false,"do_sample":true,"repetition_penalty": 1.00,"temperature": 0.01,"top_p": 0.001,"top_k": 1,"model": "qwen_distill_7b"}'

复制代码

注意：上面 curl 语句中的 1025 端口与 qwen_distill_7b 分别与/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json 配置文件里的 port 和 modelName 对应

安装 lighteval

python 包安装（不推荐）

pip install lighteval

复制代码

python 包安装可能由于版本问题会和 open-r1 提供的评测脚本有代码冲突，建议选择通过源码安装

源码安装（推荐）

git clone https://github.com/huggingface/lighteval.gitcd lighteval# 本次验证参考版本commit-id：ed084813e0bd12d82a06d9f913291fdbee774905git checkout ed084813e0bd12d82a06d9f913291fdbee774905pip install .pip install .[math]

复制代码

编写 evaluate.py

参考 Open R1 项目的的 src/open_r1/evaluate.py 准备好 evaluate.py 文件，用于自定义评测任务：

# Copyright 2025 The HuggingFace Team. All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.
"""Custom evaluation tasks for LightEval."""
import random
from lighteval.metrics.dynamic_metrics import (    ExprExtractionConfig,    IndicesExtractionConfig,    LatexExtractionConfig,    multilingual_extractive_match_metric,)from lighteval.tasks.lighteval_task import LightevalTaskConfigfrom lighteval.tasks.requests import Docfrom lighteval.utils.language import Language
latex_gold_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    fallback_mode="first_match",    precision=5,    gold_extraction_target=(LatexExtractionConfig(),),    # Match boxed first before trying other regexes    pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)),    aggregation_function=max,)
expr_gold_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    fallback_mode="first_match",    precision=5,    gold_extraction_target=(ExprExtractionConfig(),),    # Match boxed first before trying other regexes    pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)),    aggregation_function=max,)
gpqa_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    gold_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")],    pred_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")],    precision=5,)
def prompt_fn(line, task_name: str = None):    """Assumes the model is either prompted to emit \\boxed{answer} or does so automatically"""    return Doc(        task_name=task_name,        query=line["problem"],        choices=[line["solution"]],        gold_index=0,    )
def aime_prompt_fn(line, task_name: str = None):    return Doc(        task_name=task_name,        query=line["problem"],        choices=[line["answer"]],        gold_index=0,    )
def gpqa_prompt_fn(line, task_name: str = None):    """Prompt template adapted from simple-evals: https://github.com/openai/simple-evals/blob/83ed7640a7d9cd26849bcb3340125002ef14abbe/common.py#L14"""    gold_index = random.randint(0, 3)    choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]]    choices.insert(gold_index, line["Correct Answer"])    query_template = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.\n\n{Question}\n\nA) {A}\nB) {B}\nC) {C}\nD) {D}"    query = query_template.format(A=choices[0], B=choices[1], C=choices[2], D=choices[3], Question=line["Question"])
    return Doc(        task_name=task_name,        query=query,        choices=["A", "B", "C", "D"],        gold_index=gold_index,        instruction=query,    )
# Define tasksaime24 = LightevalTaskConfig(    name="aime24",    suite=["custom"],    prompt_function=aime_prompt_fn,    hf_repo="HuggingFaceH4/aime_2024",    hf_subset="default",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[expr_gold_metric],    version=1,)aime25 = LightevalTaskConfig(    name="aime25",    suite=["custom"],    prompt_function=aime_prompt_fn,    hf_repo="yentinglin/aime_2025",    hf_subset="default",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[expr_gold_metric],    version=1,)math_500 = LightevalTaskConfig(    name="math_500",    suite=["custom"],    prompt_function=prompt_fn,    hf_repo="HuggingFaceH4/MATH-500",    hf_subset="default",    hf_avail_splits=["test"],    evaluation_splits=["test"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[latex_gold_metric],    version=1,)gpqa_diamond = LightevalTaskConfig(    name="gpqa:diamond",    suite=["custom"],    prompt_function=gpqa_prompt_fn,    hf_repo="Idavidrein/gpqa",    hf_subset="gpqa_diamond",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,  # needed for reasoning models like R1    metric=[gpqa_metric],    stop_sequence=[],  # no stop sequence, will use eos token    trust_dataset=True,    version=1,)
# Add tasks to the tableTASKS_TABLE = []TASKS_TABLE.append(aime24)TASKS_TABLE.append(aime25)TASKS_TABLE.append(math_500)TASKS_TABLE.append(gpqa_diamond)
# MODULE LOGICif __name__ == "__main__":    print([t["name"] for t in TASKS_TABLE])    print(len(TASKS_TABLE))

复制代码

测评方式说明

lighteval 提供了几个主要的入口点来进行模型评估，但是在 NPU 设备上使用 vllm 和 tgi 模式可能还需要对代码进行适配，因此我们可以使用的就是 accelerate 和 endpoint openai 两个模式。但是 accelerate 模型存在输出较短和推理速度较慢的问题，==因此推荐基于 openai 模式，并通过 MindIE 部署模型推理服务实现评测==。

使用 openai 模式测评（基于 lighteval+MindIE 服务）

lighteval 也支持通过 openai 第三方库对模型推理服务进行评分，但当前评测框架没有对本地模型服务的情况进行适配，通过分析代码发现适配量不大，因此采用此模式对 MindIE 服务进行模型评测。

lighteval 适配本地推理服务

查找 lighteval 安装路径

通过 pip show lighteval 找到 lighteval 安装路径

修改 openai_model.py

修改 lighteval python 安装包路径下“lighteval/models/endpoints/openai_model.py”代码：

修改 base_url 读取方式，使其可以从环境变量读取

请求的 model 字段改成现有的 MindIE 模型服务的模型名，此处为"qwen_distill_7b"

评测命令

==说明：==

下面 OPENAI_BASE_URL 需配置为服务地址，注意需保证服务可访问，可通过 curl 命令验证；OPENAI_API_KEY 未使用，但是仍需配置，要求非空。
--custom-tasks 后填写前面自定义的 evaluate.py 文件路径

配置环境变量

MODEL="/home/test/DeepSeek-R1-Distill-Qwen-7B"MODEL_ARGS="$MODEL"OUTPUT_DIR=./data/evals/$MODELexport OPENAI_BASE_URL="http://127.0.0.1:1025/v1"export OPENAI_API_KEY="test"

复制代码

AIME 2024 测评

TASK=aime24
lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \    --custom-tasks evaluate.py \--output-dir $OUTPUT_DIR

复制代码

MATH-500 测评

TASK=math_500
lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \    --custom-tasks evaluate.py \    --output-dir $OUTPUT_DIR

复制代码

测评结果

AIME2024 测评结果

MATH-500 测评结果

DeepSeek 官方测评结果

发布于: 刚刚阅读数: 4

小顺637

关注

还未添加个人签名 2023-01-19 加入

还未添加个人简介

发布

暂无评论

创作场景

昇腾 NPU 上基于 MindIE 服务的 AIME 和 MATH500 测评方案

背景

约束条件

模型权重下载

MindIE 服务化部署

镜像下载

启动容器

环境变量

启动 MindIE 服务

修改 MindIE 配置文件

修改模型配置文件

修改模型路径权限

启动 MindIE 服务

MindIE 服务测试

安装 lighteval

python 包安装（不推荐）

源码安装（推荐）

编写 evaluate.py

测评方式说明

使用 openai 模式测评（基于 lighteval+MindIE 服务）

lighteval 适配本地推理服务

查找 lighteval 安装路径

修改 openai_model.py

评测命令

配置环境变量

AIME 2024 测评

MATH-500 测评

测评结果

AIME2024 测评结果

MATH-500 测评结果

DeepSeek 官方测评结果

小顺637

评论