LLM推理优化技术

约 1565 字大约 5 分钟

llminference

2025-08-31

大型语言模型的推理成本是制约其大规模部署的主要瓶颈。本文系统介绍从内存、计算、并行化到量化等各层面的推理优化技术。

KV Cache

自回归生成中，每个新 token 的生成都需要对所有之前 token 的 Key 和 Value 进行注意力计算。KV Cache 将已计算的 K/V 缓存起来，避免重复计算。

KV Cache 的内存消耗公式：

\text{KV Cache Size} = 2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq\_len} \times \text{batch\_size} \times \text{dtype\_size}

对于 LLaMA-70B（80 层、64 头、128 维/头），FP16 下每个 token 的 KV Cache 约 2.5 MB，4K 序列长度则需约 10 GB。

# 简化的 KV Cache 实现
class KVCache:
    def __init__(self):
        self.key_cache = []    # list of (batch, heads, seq_len, d_k)
        self.value_cache = []

    def update(self, layer_idx, new_key, new_value):
        if layer_idx >= len(self.key_cache):
            self.key_cache.append(new_key)
            self.value_cache.append(new_value)
        else:
            self.key_cache[layer_idx] = torch.cat(
                [self.key_cache[layer_idx], new_key], dim=2
            )
            self.value_cache[layer_idx] = torch.cat(
                [self.value_cache[layer_idx], new_value], dim=2
            )

    def get(self, layer_idx):
        return self.key_cache[layer_idx], self.value_cache[layer_idx]

GQA 与 MQA 减少 KV Cache

Multi-Query Attention (MQA)：所有 Query 头共享一组 Key 和 Value，KV Cache 缩小为原来的 $1/n_{heads}$
Grouped Query Attention (GQA)：将 Query 头分组，每组共享一组 KV，介于 MHA 和 MQA 之间。LLaMA 2 70B 和 Mistral 采用此方案

PagedAttention（vLLM）

传统 KV Cache 为每个请求预分配连续内存块，导致严重的内存碎片和浪费。vLLM 借鉴操作系统虚拟内存的分页思想，将 KV Cache 划分为固定大小的块（block），按需分配。

关键优势：

内存利用率从约 20-40% 提升至接近 100%
支持请求间共享 KV Cache 块（如共同的 system prompt）
吞吐量提升 2-4 倍

# vLLM 使用示例
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    tensor_parallel_size=2,           # 2 GPU 张量并行
    gpu_memory_utilization=0.9,       # GPU 内存利用率
    max_model_len=4096,
)

prompts = ["Explain quantum computing in simple terms.",
           "Write a Python function for binary search."]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Continuous Batching

传统的 static batching 会等待一个 batch 中所有请求完成才开始处理下一批。Continuous batching（也称 iteration-level batching）在每个解码步骤级别动态调度请求：

已完成的请求立即被释放
新请求可以在任意解码步插入
GPU 利用率显著提升

Speculative Decoding（投机解码）

使用一个小而快的"草稿模型"快速生成多个候选 token，然后用大模型并行验证这些 token 的正确性。

在不改变输出分布的前提下，可实现 2-3 倍的加速。

# Hugging Face 中使用投机解码
from transformers import AutoModelForCausalLM, AutoTokenizer

# 大模型（验证）
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
# 小模型（草稿）
assistant_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    max_new_tokens=100,
    do_sample=False,
)

FlashAttention

FlashAttention 通过 IO-aware 的分块算法优化 attention 计算，减少 GPU HBM（高带宽内存）和 SRAM（片上内存）之间的数据传输。

核心思想：

将 Q、K、V 分块加载到 SRAM
在 SRAM 中完成 attention 计算
使用 online softmax 技术避免全局 softmax
避免将完整的 attention 矩阵写回 HBM

FlashAttention-2 在此基础上进一步优化了并行性和工作分配。

# PyTorch 2.0+ 原生支持
import torch.nn.functional as F

# 使用 scaled_dot_product_attention（自动选择最优实现）
output = F.scaled_dot_product_attention(
    query, key, value,
    attn_mask=mask,
    is_causal=True,   # 因果 attention mask
    dropout_p=0.0,
)

并行策略

Tensor Parallelism（张量并行）

将单个层的权重矩阵沿特定维度切分到多个 GPU 上，每个 GPU 计算部分结果后进行 All-Reduce 通信。适合单机多卡，延迟敏感场景。

Pipeline Parallelism（流水线并行）

将模型的不同层分配到不同 GPU 上，形成流水线。适合跨机并行，但引入了流水线气泡（bubble）。

TensorRT-LLM

NVIDIA 的 TensorRT-LLM 集成了上述多种优化技术，提供了端到端的 LLM 推理加速方案。

关键特性：

自动融合算子（Layer Fusion）
支持 FP16/INT8/INT4/FP8 量化
内置 KV Cache 管理和 Paged Attention
Tensor Parallelism 和 Pipeline Parallelism
Inflight batching（Continuous batching）

# TensorRT-LLM 构建和运行示例
# 1. 转换模型
# python convert_checkpoint.py --model_dir llama-2-7b --output_dir ./ckpt

# 2. 构建 TRT 引擎
# trtllm-build --checkpoint_dir ./ckpt \
#     --output_dir ./engine \
#     --gemm_plugin float16 \
#     --max_batch_size 64 \
#     --max_input_len 2048 \
#     --max_output_len 512

# 3. Python 运行
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir("./engine")
outputs = runner.generate(
    batch_input_ids=input_ids,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
)

优化技术对比

技术	优化目标	加速倍数	精度影响
KV Cache	避免重复计算	必须	无
PagedAttention	内存效率	2-4x 吞吐	无
Continuous Batching	GPU 利用率	2-3x 吞吐	无
Speculative Decoding	减少解码步数	2-3x 延迟	无
FlashAttention	内存IO	2-4x	无
量化 INT8/INT4	内存+计算	2-4x	轻微
张量并行	延迟	线性	无

总结

LLM 推理优化是一个系统工程，需要在内存管理（KV Cache、PagedAttention）、计算效率（FlashAttention、算子融合）、调度策略（Continuous Batching、Speculative Decoding）和硬件利用（并行策略、量化）等多个维度协同优化。实际部署时应根据延迟、吞吐、成本等需求选择合适的优化组合。

贡献者

withesse

更新日志

2026/3/14 13:09

查看所有更新日志

9f6c2-feat: organize wiki content and refresh site setup于 2026/3/14