Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large language model (LLM) training demands extensive data parallelism, resulting in massive gradient communication overhead. While gradient quantization presents a promising solution, it faces two critical challenges: maintaining training stability for transformer architectures and adapting to modern AllReduce-based distributed communication systems. In this paper, we propose BitDP, an ultra-low bit gradient quantization and data parallelism system that reduces communication costs by up to 32× while preserving model accuracy with less than 1\% performance degradation. Our approach ensures numerical stability for large transformer models and seamlessly integrates with existing AllReduce infrastructures. We validate BitDP's effectiveness across various LLM sizes and architectural variants, achieving significant training efficiency improvements while maintaining convergence quality. These results establish BitDP as a scalable and reliable solution for real-world LLM training at industrial scales.