论文 - AI Infra 学习社区

KV Cache

Efficient Memory Management for Large Language Model Serving with PagedAttention

推测解码

调度与批处理

PD 分离

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Splitwise: Efficient Generative LLM Inference Using Phase Splitting
TetriInfer: Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

LLM 应用