Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Training large language models at scale suffers from costly instabilities. We introduce the R-Metric, a proactive reliability metric that predicts failures before they occur by combining hardware monitoring, training dynamics, and model performance. Achieving 0.973-1.00 F1-Score with 12-minute lead time, our lightweight approach (1.8% overhead) democratizes enterprise-grade reliability monitoring for resource-constrained organizations.