Scalability And Resilience In Distributed LLM Training: A Survey Of Modern Techniques

Shreya Gupta

doi:10.63278/jicrcr.vi.3189

Authors

Shreya Gupta

DOI:

https://doi.org/10.63278/jicrcr.vi.3189

Abstract

This article scrutinizes the shifting terrain of distributed training systems for Large Language Models (LLMs), tackling the pivotal challenges of scalability and resilience. Modern LLMs' expanding size and intricacy expose fundamental constraints in traditional training approaches—namely memory limitations, computational demands, and communication bottlenecks. The article puts forward a methodical taxonomy built upon four interconnected pillars: parallelism approaches (data, tensor, pipeline, and sequence), memory optimization methods, fault tolerance mechanisms, and cluster management frameworks. The analysis reveals how these elements interact to facilitate effective training on volatile, preemptible cloud resources rather than dedicated supercomputing hardware. By highlighting the crucial interplay between these components, the article demonstrates how parallelism choices directly affect memory usage, communication dynamics, and resilience capabilities. Through detailed examination of systems engineered specifically for elastic training environments, the work spotlights innovations in checkpointing, recovery protocols, and dynamic reconfiguration. The conclusion identifies promising research avenues, including algorithm-system co-design for elasticity, automated parallelism strategy selection, standardized resilience benchmarking, and energy-conscious training methodologies.

Scalability And Resilience In Distributed LLM Training: A Survey Of Modern Techniques

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Imprint

Current Issue

Information

Indexing