Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

Authors: Rizhen Hu, Yutong He, Ran Yan, Mou Sun, Binhang Yuan, Kun Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Me Ce FO maintains robust performance under high failure rates, incurring only a 4.18% drop in throughput, demonstrating 5.0 to 6.7 greater resilience than previous SOTA approaches. Codes are available at https://github.com/pkumelon/Me Ce FO.
Researcher Affiliation	Academia	Rizhen Hu Peking University EMAIL Yutong He Peking University EMAIL Ran Yan HKUST EMAIL Mou Sun Zhejiang Lab EMAIL Binhang Yuan HKUST EMAIL Kun Yuan Peking University EMAIL
Pseudocode	Yes	Algorithm 1 Me Ce FO Algorithm Algorithm 2 Me Ce FO Forward Pass Algorithm 3 Me Ce FO Backward Pass
Open Source Code	Yes	Codes are available at https://github.com/pkumelon/Me Ce FO.
Open Datasets	Yes	We pre-train LLa MA [51] models of various sizes on the C4 [44] dataset
Dataset Splits	No	The paper mentions pre-training models on the C4 dataset and evaluates 'Validation Perplexities' (Table 3), implying a validation split. However, it does not provide specific details on how the C4 dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or a citation to a standard split methodology).
Hardware Specification	Yes	We conducted experiments on a 32-GPU cluster composed of four nodes, each with eight NVIDIA A100 GPUs. Intra-node communication leveraged NVLink (600 GB/s), and inter-node communication used Infini Band (200 GB/s).
Software Dependencies	No	We implement Me Ce FO on top of the Hexi Scale framework [59], which itself builds upon Megatron-LM [38]. Across all scenarios, we use the Adam W optimizer with β1 = 0.9, β2 = 0.999, weight_decay = 0.01, and ϵ = 1 10 8.
Experiment Setup	Yes	Across all scenarios, we use the Adam W optimizer with β1 = 0.9, β2 = 0.999, weight_decay = 0.01, and ϵ = 1 10 8. A learning rate warmup is applied over the first 10% of training iterations, followed by a cosine annealing schedule that decays the learning rate to 10% of its initial value. For Me Ce FO, the SVD frequency is set to τ = 100. The number of training steps, batch sizes, and initial learning rates are listed in Table 11 and are tuned exclusively for optimizing baseline fault-free training performance.