Glossary Q

2 terms starting with Q

Quantized Low-Rank Adaptation

QLoRA combines 4-bit quantization of the base model with LoRA fine-tuning, enabling fine-tuning of very large models (65B+ parameters) on a single consumer GPU.

QLoRA combines 4-bit quantization of the base model with LoRA fine-tuning, enabling fine-tuning of very large models (65B+ parameters) on a single consumer GPU. The base model is loaded in NF4 (normalized float 4-bit) format and kept frozen, while LoRA adapters are trained in 16-bit precision. QLoRA demonstrated that high-quality fine-tuning of 65B models is feasible on a single 48GB GPU.

LoRAQuantizationPEFTFine-tuningINT4

View full page →

Quantization ai

Model Quantization

Quantization is a model compression technique that reduces the numerical precision of weights and/or activations from floating-point (FP32/FP16) to lower-bit integer formats (INT8, INT4).

Quantization is a model compression technique that reduces the numerical precision of weights and/or activations from floating-point (FP32/FP16) to lower-bit integer formats (INT8, INT4). This reduces memory footprint, increases inference throughput, and lowers power consumption with acceptable accuracy tradeoffs. Methods include post-training quantization (GPTQ, AWQ) and quantization-aware training.

INT8INT4GPTQAWQDistillation

View full page →