LMCache - Redis for LLMs 无限高速 KV 缓存系统

作者：qife

2025-06-30
福建
本文字数：2047 字
阅读完需：约 7 分钟

项目标题与描述

LMCache 是一个创新的 LLM 服务引擎扩展，旨在减少首 token 延迟(TTFT)和提高吞吐量，特别针对长上下文场景优化。通过跨多个存储层级(GPU 显存、CPU 内存、本地磁盘)缓存可重用文本的 KV 缓存，LMCache 能够在任何服务实例中复用任意位置的重复文本(不限于前缀)，从而节省宝贵的 GPU 计算周期并降低用户响应延迟。

功能特性

跨实例 KV 缓存共享：支持在不同服务实例间共享和复用 KV 缓存
多级存储架构：自动管理 GPU 显存、CPU 内存和本地磁盘的多级缓存
vLLM 深度集成：与 vLLM 无缝协作，提供开箱即用的高性能体验
混合注意力计算：支持 CacheBlend 技术实现高效的混合注意力计算
分布式缓存：支持通过 Redis 实现分布式缓存查找和管理
多模型支持：已测试支持 Llama 3.1 8B 和 DeepSeek V2 Lite 等模型

安装指南

前置要求

Python 3.10+
CUDA 12.1+
PyTorch 2.0+
vLLM 0.3.0+

安装步骤

使用预构建的 Docker 镜像(推荐):

docker pull lmcache/vllm-openai:latest-nightly

复制代码

从源码安装:

pip install -r requirements.txtpip install -e .

复制代码

安装质量检查工具:

pip install -r requirements/lint.txtpre-commit install

复制代码

使用说明

基础使用

启动集成 vLLM 的服务:

LMCACHE_CONFIG_FILE=config.yaml python -m lmcache_vllm.vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2

复制代码

基准测试

运行多轮 QA 基准测试:

python3 multi-round-qa.py \    --num-users 10 \    --num-rounds 5 \    --qps 0.5 \    --shared-system-prompt 1000 \    --user-history-prompt 2000 \    --answer-len 100 \    --model mistralai/Mistral-7B-Instruct-v0.2 \    --base-url http://localhost:8000/v1

复制代码

API 示例

使用 OpenAI 兼容 API:

import openai
response = openai.Completion.create(    model="mistralai/Mistral-7B-Instruct-v0.2",    prompt="The capital of France is\nA. Berlin\nB. Madrid\nC. Paris\nD. Rome\nAnswer:",    temperature=0,    max_tokens=3)

复制代码

核心代码

缓存引擎初始化

def init_lmcache_engine(    model_config: ModelConfig,    tp_size: int,    rank: int,    world_size: int,) -> Optional[LMCacheEngine]:    config = lmcache_get_config()    kv_dtype = get_kv_cache_torch_dtype(model_config.dtype)        # 构造KV形状(用于内存池)    num_layer = model_config.num_hidden_layers    chunk_size = config.chunk_size    num_kv_head = model_config.get_num_kv_heads(tp_size)    head_dim = model_config.head_dim    kv_shape = (num_layer, 2, chunk_size, num_kv_head, head_dim)
    metadata = LMCacheEngineMetadata(        model_config.model_path,        world_size,        rank,        "sgl",        kv_dtype,        kv_shape,    )        engine = LMCacheEngineBuilder.get_or_create(        ENGINE_NAME, config, metadata, gpu_connector    )    return engine

复制代码

KV 缓存检索逻辑

def retrieve(    self,    token_ids: torch.Tensor,    mask: Optional[torch.Tensor] = None,    kvcaches: Optional[List[torch.Tensor]] = None,    slot_mapping: Optional[torch.Tensor] = None,    offset: int = 0,) -> torch.Tensor:    # 处理输入验证    assert isinstance(token_ids, torch.Tensor)    if mask is None:        mask = torch.ones_like(token_ids, dtype=torch.bool)        # 从缓存中检索KV    ret_token_mask = torch.zeros_like(token_ids, dtype=torch.bool)    for start, end, key in self.token_db.process_tokens(token_ids, mask):        cached_data = self.engine.get(key)        if cached_data is not None:            # 将缓存数据加载到GPU缓冲区            self.gpu_connector.from_gpu(cached_data, start, end, kvcaches, slot_mapping)            ret_token_mask[start:end] = True        return ret_token_mask

复制代码

混合注意力计算

def process_qkv(    self,    q: torch.Tensor,    k: torch.Tensor,    v: torch.Tensor,    residual: torch.Tensor,    layer_id: int,    attn_output: Optional[torch.Tensor] = None,) -> torch.Tensor:    # 检查是否需要在此层执行混合计算    if layer_id not in self.common_metadata.check_layers:        return attn_output        # 从缓存中检索相关KV    retrieved_k, retrieved_v = self.cache_engine.retrieve_kv(q)        # 执行混合注意力计算    blended_output = self.blender.blend(        q, k, v,         retrieved_k, retrieved_v,        self.common_metadata.recomp_ratios[0]    )        # 残差连接    if residual is not None:        blended_output += residual            return blended_output

复制代码

更多精彩内容请关注我的个人公众号公众号（办公 AI 智能小助手）公众号二维码

办公AI智能小助手

发布于: 刚刚阅读数: 4

qife

关注

还未添加个人签名 2021-05-19 加入

还未添加个人简介

发布

暂无评论

创作场景

LMCache - Redis for LLMs 无限高速 KV 缓存系统

项目标题与描述

功能特性

安装指南

前置要求

安装步骤

使用说明

基础使用

基准测试

API 示例

核心代码

缓存引擎初始化

KV 缓存检索逻辑

混合注意力计算

qife

评论