Skip to content
ai

vLLM

vLLM Inference Engine

Definition

vLLM is an open-source high-throughput LLM serving engine that uses PagedAttention to manage KV cache memory in non-contiguous pages, similar to virtual memory in operating systems. This eliminates KV cache fragmentation, dramatically increasing GPU utilization and throughput for concurrent requests.

vLLM supports continuous batching, tensor parallelism, and dozens of open-source model architectures.


Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.