DeepSeek 模型量化

2025-02-21
福建
本文字数：10233 字
阅读完需：约 34 分钟

技术背景

大语言模型（Large Language Model，LLM），可以通过量化（Quantization）操作来节约内存/显存的使用，并且降低了通讯开销，进而达到加速模型推理的效果。常见的就是把 Float16 的浮点数，转换成低精度的整数，例如 Int4 整数。最极限的情况下，可以把参数转化成二值 Bool 变量，也就是只有 0 和 1，但是这种大幅度的量化有可能导致模型的推理效果不佳。常用的是，在 70B 以下的模型用 Q8，70B 以上可以用 Q4。具体的原理，包括对称量化和非对称量化等，这里就不作介绍了，主要看看工程上怎么实现，主要使用了llama.cpp来完成量化。

安装 llama.cpp

这里我们在 Ubuntu 上使用本地编译构建的方法进行安装，首先从 github 上面 clone 下来：

$ git clone https://github.com/ggerganov/llama.cpp.git正克隆到 'llama.cpp'...remote: Enumerating objects: 43657, done.remote: Counting objects: 100% (15/15), done.remote: Compressing objects: 100% (14/14), done.remote: Total 43657 (delta 3), reused 5 (delta 1), pack-reused 43642 (from 3)接收对象中: 100% (43657/43657), 88.26 MiB | 8.30 MiB/s, 完成.处理 delta 中: 100% (31409/31409), 完成.

复制代码

最好创建一个虚拟环境，以避免各种软件依赖的问题，推荐 Python3.10：

# 创建虚拟环境$ conda create -n llama python=3.10# 激活虚拟环境$ conda activate llama

复制代码

进入下载好的 llama.cpp 路径，安装所有的依赖项：

$ cd llama.cpp/$ python3 -m pip install -e .

复制代码

创建一个编译目录，执行编译指令：

$ mkdir build$ cd build/$ cmake ..-- The C compiler identification is GNU 7.5.0-- The CXX compiler identification is GNU 9.4.0-- Detecting C compiler ABI info-- Detecting C compiler ABI info - done-- Check for working C compiler: /usr/bin/cc - skipped-- Detecting C compile features-- Detecting C compile features - done-- Detecting CXX compiler ABI info-- Detecting CXX compiler ABI info - done-- Check for working CXX compiler: /usr/bin/c++ - skipped-- Detecting CXX compile features-- Detecting CXX compile features - done-- Found Git: /usr/bin/git (found version "2.25.1") -- Looking for pthread.h-- Looking for pthread.h - found-- Performing Test CMAKE_HAVE_LIBC_PTHREAD-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed-- Check if compiler accepts -pthread-- Check if compiler accepts -pthread - yes-- Found Threads: TRUE  -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF-- CMAKE_SYSTEM_PROCESSOR: x86_64-- Including CPU backend-- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5")  -- x86 detected-- Adding CPU backend variant ggml-cpu: -march=native -- Configuring done-- Generating done-- Build files have been written to: /datb/DeepSeek/llama/llama.cpp/build$ cmake --build . --config ReleaseScanning dependencies of target ggml-base[  0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o[100%] Linking CXX executable ../../bin/llama-vdot[100%] Built target llama-vdot

复制代码

到这里，就成功构建了 cpu 版本的 llama.cpp，可以直接使用了。如果需要安装 gpu 加速的版本，可以参考下面这一小节，如果嫌麻烦建议直接跳过。

llama.cpp 之 CUDA 加速

安装 GPU 版本 llama.cpp 需要先安装一些依赖：

$ sudo apt install curl libcurl4-openssl-dev

复制代码

跟 cpu 版本不同的地方，主要在于 cmake 的编译指令（如果已经编译了 cpu 的版本，最好先清空build路径下的文件）：

$ cmake .. -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

复制代码

这里加的一个 FLAG：-DCMAKE_CUDA_STANDARD=17可以解决 Llama.cpp 仓库里面的 Issue，如果不加这个 Flag，有可能出现下面这种报错：

Make Error in ggml/src/ggml-cuda/CMakeLists.txt:  Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler  extensions), but CMake does not know the compile flags to use to enable it.

复制代码

如果顺利的话，执行下面这个指令，成功编译通过的话就是成功了：

$ cmake --build . --config Release

复制代码

但是如果像我这样有报错信息，那就得单独处理以下。

/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/vendors/cuda.h:6:10: fatal error: cuda_bf16.h: 没有那个文件或目录 #include <cuda_bf16.h>          ^~~~~~~~~~~~~compilation terminated.

复制代码

这个报错是说找不到头文件，于是在环境里面find / -name cuda_bf16.h了一下，发现其实是有这个头文件的：

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_bf16.h/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/triton/backends/nvidia/include/cuda_bf16.h

复制代码

处理方式是把这个路径加到CPATH里面：

$ export CPATH=$CPATH:/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/

复制代码

如果是出现这个报错：

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_fp16.h:4100:10: fatal error: nv/target: 没有那个文件或目录 #include <nv/target>          ^~~~~~~~~~~compilation terminated.

复制代码

那就是找不到 target 目录的路径，如果本地有 target 路径的话，也可以直接配置到CPATH里面：

$ export CPATH=/home/dechin/anaconda3/pkgs/cupy-core-13.3.0-py310h5da974a_2/lib/python3.10/site-packages/cupy/_core/include/cupy/_cccl/libcudacxx/:$CPATH

复制代码

如果是下面这些报错：

/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(138): error: identifier "cublasGetStatusString" is undefined
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(417): error: A __device__ variable cannot be marked constexpr
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(745): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_000a126f_00000000-9_acc.compute_75.cpp1.ii".make[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:82：ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o] 错误 1make[1]: *** [CMakeFiles/Makefile2:1964：ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/all] 错误 2make: *** [Makefile:160：all] 错误 2

复制代码

那么很有可能是cuda-toolkit的版本问题，尝试安装 cuda-12：

$ conda install nvidia::cuda-toolkit

复制代码

如果使用 conda 安装过程有这种问题：

Collecting package metadata (current_repodata.json): failed
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
    Traceback (most recent call last):      File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 132, in conda_http_errors        yield      File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 101, in repodata        response.raise_for_status()      File "/home/dechin/anaconda3/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status        raise HTTPError(http_error_msg, response=self)    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/current_repodata.json

复制代码

那应该是 conda 源的问题，可以删掉旧的 channels，使用默认 channels 或者找一个国内可以用的镜像源进行配置：

$ conda config --remove-key channels$ conda config --remove-key default_channels$ conda config --append channels conda-forge

复制代码

重新安装以后，nvcc 的路径发生了变化，要注意修改下编译时的DCMAKE_CUDA_COMPILER参数配置：

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

复制代码

如果出现如下报错：

-- Unable to find cuda_runtime.h in "/home/dechin/anaconda3/envs/llama/include" for CUDAToolkit_INCLUDE_DIR.-- Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:151 (message):  CUDA Toolkit not found

-- Configuring incomplete, errors occurred!See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeOutput.log".See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeError.log".

复制代码

这是找不到CUDAToolkit_INCLUDE_DIR的路径配置，只要在 cmake 的指令里面加上一个 include 路径即可：

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17 -DCUDAToolkit_INCLUDE_DIR=/home/dechin/anaconda3/envs/llama/targets/x86_64-linux/include/ -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/

复制代码

如果经过以上的一串处理，依然有报错信息，那我建议还是用个 Docker 吧，或者直接用 CPU 版本执行 quantize，模型调用使用 Ollama，这样方便一些。

下载 Hugging Face 模型

由于很多已经完成量化的 GGUF 模型文件，无法被二次量化，所以建议直接从 Hugging Face 下载 safetensors 模型文件。然后用 llama.cpp 里面的一个 Python 脚本将 hf 模型转为 gguf 模型，然后再使用 llama.cpp 进行模型 quantize。

关于模型下载这部分，因为 Hugging Face 的访问有时候也会受限，所以这里首推的还是国内的 ModelScope 平台。从 ModelScope 平台下载模型，可以装一个这种 Python 形式的 modelscope：

$ python3 -m pip install modelscopeLooking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleRequirement already satisfied: modelscope in /home/dechin/anaconda3/lib/python3.8/site-packages (1.22.3)Requirement already satisfied: requests>=2.25 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (2.25.1)Requirement already satisfied: urllib3>=1.26 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (1.26.5)Requirement already satisfied: tqdm>=4.64.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from modelscope) (4.67.1)Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2021.5.30)Requirement already satisfied: chardet<5,>=3.0.2 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (4.0.0)Requirement already satisfied: idna<3,>=2.5 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.10)

复制代码

然后使用 modelcope 下载模型：

$ modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

复制代码

如果出现报错（如果没有报错就不用理会，等待模型下载完成即可）：

safetensors integrity check failed, expected sha256 signature is xxx

复制代码

可以尝试另一种安装方式：

$ sudo apt install git-lfs

复制代码

下载模型：

$ git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git正克隆到 'DeepSeek-R1-Distill-Qwen-32B'...remote: Enumerating objects: 52, done.remote: Counting objects: 100% (52/52), done.remote: Compressing objects: 100% (37/37), done.remote: Total 52 (delta 17), reused 42 (delta 13), pack-reused 0展开对象中: 100% (52/52), 2.27 MiB | 2.62 MiB/s, 完成.过滤内容: 100% (8/8), 5.02 GiB | 912.00 KiB/s, 完成.Encountered 8 file(s) that may not have been copied correctly on Windows:	model-00005-of-000008.safetensors	model-00004-of-000008.safetensors	model-00008-of-000008.safetensors	model-00002-of-000008.safetensors	model-00007-of-000008.safetensors	model-00003-of-000008.safetensors	model-00006-of-000008.safetensors	model-00001-of-000008.safetensors
See: `git lfs help smudge` for more details.

复制代码

这个过程会消耗很多时间，请耐心等待模型下载完成为止。下载完成后查看路径：

$ cd DeepSeek-R1-Distill-Qwen-32B/$ ll总用量 63999072drwxrwxr-x 4 dechin dechin       4096 2月  12 19:22 ./drwxrwxr-x 3 dechin dechin       4096 2月  12 17:46 ../-rw-rw-r-- 1 dechin dechin        664 2月  12 17:46 config.json-rw-rw-r-- 1 dechin dechin         73 2月  12 17:46 configuration.jsondrwxrwxr-x 2 dechin dechin       4096 2月  12 17:46 figures/-rw-rw-r-- 1 dechin dechin        181 2月  12 17:46 generation_config.jsondrwxrwxr-x 9 dechin dechin       4096 2月  12 19:22 .git/-rw-rw-r-- 1 dechin dechin       1519 2月  12 17:46 .gitattributes-rw-rw-r-- 1 dechin dechin       1064 2月  12 17:46 LICENSE-rw-rw-r-- 1 dechin dechin 8792578462 2月  12 19:22 model-00001-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906899 2月  12 19:03 model-00002-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月  12 19:18 model-00003-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月  12 18:56 model-00004-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月  12 18:38 model-00005-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月  12 19:19 model-00006-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 8776906927 2月  12 19:15 model-00007-of-000008.safetensors-rw-rw-r-- 1 dechin dechin 4073821536 2月  12 19:02 model-00008-of-000008.safetensors-rw-rw-r-- 1 dechin dechin      64018 2月  12 17:46 model.safetensors.index.json-rw-rw-r-- 1 dechin dechin      18985 2月  12 17:46 README.md-rw-rw-r-- 1 dechin dechin       3071 2月  12 17:46 tokenizer_config.json-rw-rw-r-- 1 dechin dechin    7031660 2月  12 17:46 tokenizer.json

复制代码

这就是下载成功了。

HF 模型转 GGUF 模型

找到编译好的llama/llama.cpp/下的 python 脚本文件，可以先看下其用法：

$ python3 convert_hf_to_gguf.py --helpusage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy]                             [--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]                             [--no-tensor-first-split] [--metadata METADATA] [--print-supported-models]                             [model]
Convert a huggingface model to a GGML compatible file
positional arguments:  model                 directory containing model file
options:  -h, --help            show this help message and exit  --vocab-only          extract only the vocab  --outfile OUTFILE     path to write to; default: based on input. {ftype} will be replaced by the outtype.  --outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}                        output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-                        fidelity 16-bit float type depending on the first loaded tensor type  --bigendian           model is executed on big endian machine  --use-temp-file       use the tempfile library while processing (helpful when running out of memory, process killed)  --no-lazy             use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)  --model-name MODEL_NAME                        name of the model  --verbose             increase output verbosity  --split-max-tensors SPLIT_MAX_TENSORS                        max tensors in each split  --split-max-size SPLIT_MAX_SIZE                        max size per split N(M|G)  --dry-run             only print out a split plan and exit, without writing any new files  --no-tensor-first-split                        do not add tensors to the first split (disabled by default)  --metadata METADATA   Specify the path for an authorship metadata override file  --print-supported-models                        Print the supported models

复制代码

然后执行构建 GGUF：

$ python3 convert_hf_to_gguf.py /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B --outfile /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.ggufINFO:hf-to-gguf:Set model quantization versionINFO:gguf.gguf_writer:Writing the following files:INFO:gguf.gguf_writer:/datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf: n_tensors = 771, total_size = 65.5GWriting: 100%|██████████████████████████████████████████████████████████████| 65.5G/65.5G [19:42<00:00, 55.4Mbyte/s]INFO:hf-to-gguf:Model successfully exported to /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf

复制代码

完成转化后，会在指定的路径下生成一个 gguf 文件，也就是 all-in-one 的模型文件。默认是 fp32 的精度，可以用于执行下一步的量化操作。

GGUF 模型量化

在编译好的llama.cpp的build/bin/路径下，可以找到量化的可执行文件：

$ ./llama-quantize --helpusage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]
  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing  --pure: Disable k-quant mixtures and quantize all tensors to the same type  --imatrix file_name: use data in file_name as importance matrix for quant optimizations  --include-weights tensor_name: use importance matrix for this/these tensor(s)  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor  --keep-split: will generate quantized model in the same shards as input  --override-kv KEY=TYPE:VALUE      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.Note: --include-weights and --exclude-weights cannot be used together
Allowed quantization types:   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B  19  or  IQ2_XXS :  2.06 bpw quantization  20  or  IQ2_XS  :  2.31 bpw quantization  28  or  IQ2_S   :  2.5  bpw quantization  29  or  IQ2_M   :  2.7  bpw quantization  24  or  IQ1_S   :  1.56 bpw quantization  31  or  IQ1_M   :  1.75 bpw quantization  36  or  TQ1_0   :  1.69 bpw ternarization  37  or  TQ2_0   :  2.06 bpw ternarization  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B  23  or  IQ3_XXS :  3.06 bpw quantization  26  or  IQ3_S   :  3.44 bpw quantization  27  or  IQ3_M   :  3.66 bpw quantization mix  12  or  Q3_K    : alias for Q3_K_M  22  or  IQ3_XS  :  3.3 bpw quantization  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B  25  or  IQ4_NL  :  4.50 bpw non-linear quantization  30  or  IQ4_XS  :  4.25 bpw non-linear quantization  15  or  Q4_K    : alias for Q4_K_M  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B  17  or  Q5_K    : alias for Q5_K_M  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B   0  or  F32     : 26.00G              @ 7B          COPY    : only copy tensors, no quantizing

复制代码

这里可以看到完整的可以执行量化操作的精度。例如我们可以量化一个q4_0精度的 32B 模型：

$ ./llama-quantize /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf q4_0

复制代码

输出结果对比（这里的 Q8_0 是直接从模型仓库里面下载的别人量化出来的 Q8_0 模型）：

-rw-rw-r-- 1 dechin dechin 65535969184 2月  13 09:33 DeepSeek-R1-Distill-Qwen-32B.gguf-rw-rw-r-- 1 dechin dechin 18640230304 2月  13 09:51 DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf-rw-rw-r-- 1 dechin dechin 34820884384 2月   9 01:44 DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf

复制代码

从 F32 到 Q8 再到 Q4，可以看到有一个很明显的内存占用的下降。我们可以根据自己本地的计算机资源来决定要做多少精度的量化操作。

量化完成后，导入模型成功以后，可以用ollama list查看到所有的本地模型：

$ ollama listNAME                            ID              SIZE      MODIFIED      deepseek-r1:32b-q2k             8d2a0c19f6e0    12 GB     5 seconds ago    deepseek-r1:32b-q40             13c7c287f615    18 GB     3 minutes ago    deepseek-r1:32b                 91f2de3dd7fd    34 GB     42 hours ago     nomic-embed-text-v1.5:latest    5b3683392ccb    274 MB    43 hours ago     deepseek-r1:14b                 ea35dfe18182    9.0 GB    7 days ago

复制代码

这里 q2k 也是本地量化的Q2_K的模型。只是从Q4_0到Q2_k已经没有太大的参数内存缩减了，所以很多人量化一般就到Q4_0这个级别，可以兼具性能与精确性。

其他报错处理

如果运行llama-quantize这个可执行文件出现这种报错：

./xxx/llama-quantize: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory

复制代码

动态链接库路径LD_LIBRARY_PATH没有设置，也可以选择直接进入到bin/路径下运行该可执行文件。

总结概要

这篇文章主要介绍了 llama.cpp 这一大模型工具的使用。因为已经使用 Ollama 来 run 大模型，因此仅介绍了 llama.cpp 在 HF 模型转 GGUF 模型中的应用，及其在大模型量化中的使用。大模型的参数量化技术，使得我们可以在本地有限预算的硬件条件下，也能够运行 DeepSeek 的蒸馏模型。

文章转载自：Dechin 的博客
原文链接：https://www.cnblogs.com/dechinphy/p/18711084/quantize
体验地址：http://www.jnpfsoft.com/?from=001YH

发布于: 刚刚阅读数: 6

快乐非自愿限量之名

关注

还未添加个人签名 2023-06-19 加入

还未添加个人简介

发布

暂无评论

创作场景

DeepSeek 模型量化

技术背景

安装 llama.cpp

llama.cpp 之 CUDA 加速

下载 Hugging Face 模型

HF 模型转 GGUF 模型

GGUF 模型量化

其他报错处理

总结概要

快乐非自愿限量之名

评论