Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization
Authors: Rizhen Hu, Yutong He, Ran Yan, Mou Sun, Binhang Yuan, Kun Yuan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Me Ce FO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating 5.0 to 6.7 greater resilience than previous SOTA approaches. Codes are available at https://github.com/pkumelon/Me Ce FO. |
| Researcher Affiliation | Academia | Rizhen Hu Peking University EMAIL Yutong He Peking University EMAIL Ran Yan HKUST EMAIL Mou Sun Zhejiang Lab EMAIL Binhang Yuan HKUST EMAIL Kun Yuan Peking University EMAIL |
| Pseudocode | Yes | Algorithm 1 Me Ce FO Algorithm Algorithm 2 Me Ce FO Forward Pass Algorithm 3 Me Ce FO Backward Pass |
| Open Source Code | Yes | Codes are available at https://github.com/pkumelon/Me Ce FO. |
| Open Datasets | Yes | We pre-train LLa MA [51] models of various sizes on the C4 [44] dataset |
| Dataset Splits | No | The paper mentions pre-training models on the C4 dataset and evaluates 'Validation Perplexities' (Table 3), implying a validation split. However, it does not provide specific details on how the C4 dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or a citation to a standard split methodology). |
| Hardware Specification | Yes | We conducted experiments on a 32-GPU cluster composed of four nodes, each with eight NVIDIA A100 GPUs. Intra-node communication leveraged NVLink (600 GB/s), and inter-node communication used Infini Band (200 GB/s). |
| Software Dependencies | No | We implement Me Ce FO on top of the Hexi Scale framework [59], which itself builds upon Megatron-LM [38]. Across all scenarios, we use the Adam W optimizer with β1 = 0.9, β2 = 0.999, weight_decay = 0.01, and ϵ = 1 10 8. |
| Experiment Setup | Yes | Across all scenarios, we use the Adam W optimizer with β1 = 0.9, β2 = 0.999, weight_decay = 0.01, and ϵ = 1 10 8. A learning rate warmup is applied over the first 10% of training iterations, followed by a cosine annealing schedule that decays the learning rate to 10% of its initial value. For Me Ce FO, the SVD frequency is set to τ = 100. The number of training steps, batch sizes, and initial learning rates are listed in Table 11 and are tuned exclusively for optimizing baseline fault-free training performance. |