reproducibilityindex.ai

On Efficient Constructions of Checkpoints

Authors: Yu Chen, Zhenming Liu, Bin Ren, Xin Jin

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28 and recovery speedup up to 5.77 over a state-of-the-art algorithm (SCAR). This section evaluates LC-Checkpoint on four typical ML applications with three benchmark datasets, and compares it with previous efforts (SCAR Qiao et al. (2018b) and a TOPN mechanism as mentioned in Section 2) on recovery (rework) cost, compression ratio, and execution overhead, demonstrating the superiority of LC-Checkpoint.
Researcher Affiliation	Academia	Yu Chen 1 Zhenming Liu 1 Bin Ren 1 Xin Jin 2 1William & Mary, Williamsburg, Virginia, USA 2Johns Hopkins University, Baltimore, Maryland, USA.
Pseudocode	Yes	Algorithm 1 LC-CHECKPOINT-BASED SGD Input: u , u0, η 1: Initialize u0 = u0. 2: for t = 1 to T do 3: Update model state: ut = u + η(ut 1 u ) + ϵ 4: Compute distance: δt = ut ut 1 5: Quantize δt: δt = QUANTIZE(δt) 6: Compress δt by Huffman coding and save to disk 7: Update checkpoint state: ut = ut 1 + δt 8: end for Output: u T , { δt \| t [T]}
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	Yes	ML Applications and Datasets: LC-Checkpoint is evaluated on four typical ML applications: Multinomial Logistic Regression (MLR), Le Net-5 (Lenet) (Le Cun et al., 1998), Alex Net (Krizhevsky et al., 2012) and Matrix Factorization (MF). The ﬁrst three applications are trained on MNIST (Le Cun et al., 1998) and Fashion MNIST (Xiao et al., 2017) datasets. The last one, MF is trained on Jester (Goldberg et al., 2001) and Movie Lens10M (Harper & Konstan, 2015).
Dataset Splits	No	The paper does not explicitly provide details about training/validation/test dataset splits. It mentions 'Two checkpoint sizes are tested: 5% and 10% of the full checkpoint size' and 'repeating each trial 50 times' for evaluating rework cost, but this is about checkpoint size/evaluation methodology, not data splits.
Hardware Specification	Yes	Our experiments are conducted on a multi-core server with an Intel Xeon Gold 6138 Skylake CPU with 40 cores, each running at 2.0 GHz, and 192 GB DDR4 memory. The training is performed on a Tesla P100 GPU with 16GB High-bandwidth Memory (HBM).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiment.
Experiment Setup	Yes	To address this issue, LC-Checkpoint employs 2-bit and 3-bit priority promotion that control its checkpoint size at 5% and 10%. Default approach creates a full checkpoint every 10 iterations. A failure is triggered at the 7-th iteration.