TensorRT

Definition

TensorRT is NVIDIA's SDK for high-performance deep learning inference, optimizing models for NVIDIA GPUs through layer fusion, precision calibration (INT8/FP16), and kernel auto-tuning. TensorRT-LLM extends these optimizations specifically for large language models with features like in-flight batching and paged KV caching.

It enables significant throughput gains over standard PyTorch inference.