Distributed training

Distributed Model Training

Definition

Distributed training splits the work of training large neural networks across multiple GPUs or machines. It encompasses data parallelism, model parallelism, and pipeline parallelism strategies.

Frameworks like PyTorch FSDP, DeepSpeed, and Megatron-LM implement distributed training with gradient synchronization, mixed precision, and memory optimization to enable training models with hundreds of billions of parameters.

Related Terms

GPU

Graphics Processing Unit

← Back to Glossary

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product