On Efficient Constructions of Checkpoints
Authors: Yu Chen, Zhenming Liu, Bin Ren, Xin Jin
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28 and recovery speedup up to 5.77 over a state-of-the-art algorithm (SCAR). This section evaluates LC-Checkpoint on four typical ML applications with three benchmark datasets, and compares it with previous efforts (SCAR Qiao et al. (2018b) and a TOPN mechanism as mentioned in Section 2) on recovery (rework) cost, compression ratio, and execution overhead, demonstrating the superiority of LC-Checkpoint. |
| Researcher Affiliation | Academia | Yu Chen 1 Zhenming Liu 1 Bin Ren 1 Xin Jin 2 1William & Mary, Williamsburg, Virginia, USA 2Johns Hopkins University, Baltimore, Maryland, USA. |
| Pseudocode | Yes | Algorithm 1 LC-CHECKPOINT-BASED SGD Input: u , u0, η 1: Initialize u0 = u0. 2: for t = 1 to T do 3: Update model state: ut = u + η(ut 1 u ) + ϵ 4: Compute distance: δt = ut ut 1 5: Quantize δt: δt = QUANTIZE(δt) 6: Compress δt by Huffman coding and save to disk 7: Update checkpoint state: ut = ut 1 + δt 8: end for Output: u T , { δt | t [T]} |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | ML Applications and Datasets: LC-Checkpoint is evaluated on four typical ML applications: Multinomial Logistic Regression (MLR), Le Net-5 (Lenet) (Le Cun et al., 1998), Alex Net (Krizhevsky et al., 2012) and Matrix Factorization (MF). The first three applications are trained on MNIST (Le Cun et al., 1998) and Fashion MNIST (Xiao et al., 2017) datasets. The last one, MF is trained on Jester (Goldberg et al., 2001) and Movie Lens10M (Harper & Konstan, 2015). |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits. It mentions 'Two checkpoint sizes are tested: 5% and 10% of the full checkpoint size' and 'repeating each trial 50 times' for evaluating rework cost, but this is about checkpoint size/evaluation methodology, not data splits. |
| Hardware Specification | Yes | Our experiments are conducted on a multi-core server with an Intel Xeon Gold 6138 Skylake CPU with 40 cores, each running at 2.0 GHz, and 192 GB DDR4 memory. The training is performed on a Tesla P100 GPU with 16GB High-bandwidth Memory (HBM). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiment. |
| Experiment Setup | Yes | To address this issue, LC-Checkpoint employs 2-bit and 3-bit priority promotion that control its checkpoint size at 5% and 10%. Default approach creates a full checkpoint every 10 iterations. A failure is triggered at the 7-th iteration. |