Glossary Q
2 terms starting with Q
QLoRA combines 4-bit quantization of the base model with LoRA fine-tuning, enabling fine-tuning of very large models (65B+ parameters) on a single consumer GPU. The base model is loaded in NF4 (normalized float 4-bit) format and kept frozen, while LoRA adapters are trained in 16-bit precision. QLoRA demonstrated that high-quality fine-tuning of 65B models is feasible on a single 48GB GPU.
View full page →Quantization is a model compression technique that reduces the numerical precision of weights and/or activations from floating-point (FP32/FP16) to lower-bit integer formats (INT8, INT4). This reduces memory footprint, increases inference throughput, and lowers power consumption with acceptable accuracy tradeoffs. Methods include post-training quantization (GPTQ, AWQ) and quantization-aware training.
View full page →