显存管理
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- SGLang: Efficient Execution of Structured Language Model Programs
推测解码
- Fast Inference from Transformers via Speculative Decoding
- Accelerating Large Language Model Decoding with Speculative Sampling
- Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
调度与批处理
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs
PD 分离
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting
- TetriInfer: Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
