vLLM

vLLM Inference Engine

Definition

vLLM is an open-source high-throughput LLM serving engine that uses PagedAttention to manage KV cache memory in non-contiguous pages, similar to virtual memory in operating systems. This eliminates KV cache fragmentation, dramatically increasing GPU utilization and throughput for concurrent requests.

vLLM supports continuous batching, tensor parallelism, and dozens of open-source model architectures.

Related Terms

Graphics Processing Unit

← Back to Glossary

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product