TSPipe: Learn from Teacher Faster with Pipelines
Authors: Hwijoon Lim, Yechan Kim, Sukmin Yun, Jinwoo Shin, Dongsu Han
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficiency of TSPipe by training various KD and SSL schemes. For example, When we train Mo Cov3 under multiple-sized Vi T architectures with 16 GPUs, TSPipe achieves up to 12.15x higher training throughput compared to inter-layer MP (Shoeybi et al., 2019). When we perform KD from Vi T networks to Res Net with 8 GPUs, TSPipe achieves up to 4.68x higher training throughput over inter-layer MP. We also evaluate the learned representation quality for SSL where we adopt asymmetric parameter update. |
| Researcher Affiliation | Academia | 1School of Electrical Engineering, KAIST, Daejeon, Republic of Korea 2Kim Jaechul Graduate School of AI, KAIST, Daejeon, Republic of Korea. Correspondence to: Dongsu Han <dhan.ee@kaist.ac.kr>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The source code is available at https://github.com/ kaist-ina/TSPipe. |
| Open Datasets | Yes | We utilize four image datasets, STL10 (Coates et al., 2011), CIFAR10/CIFAR100 (Krizhevsky et al., 2009), and Image Net100 (Russakovsky et al., 2015). |
| Dataset Splits | No | After training the BYOL model with Res Net-18 architecture over 200 epochs using Image Net100 (pre-training), we further train the linear classifier over 90 epochs to evaluate the test accuracy. We utilize the linear evaluation protocol described in Grill et al. (2020); Chen et al. (2020); Oord et al. (2018). |
| Hardware Specification | Yes | We evaluate TSPipe on a DGX-1 machine, which features 8 V100 GPUs (32GB memory) with 2 NVLink connecting adjacent GPUs (NVIDIA, 2017). To further evaluate TSPipe with 16 GPUs, we use two Azure ND40rsv2 VMs (8 V100 GPUs each) with GPUDirect RDMA for faster inter-node communication. |
| Software Dependencies | No | We implement TSPipe on Py Torch (Paszke et al., 2019). We use a multi-process design where we implement CPU-CPU communication with Py Torch RPC and GPU-GPU communication via NCCL (NVIDIA, 2021). |
| Experiment Setup | Yes | For BYOL, we use four different sizes of Res Net (He et al., 2016) as its backbone architecture. LARS (You et al., 2017) optimizer is used with base learning rate of lr = 0.2 which linearly scales w.r.t the batch size(lr (Batch Size)/256) (Goyal et al., 2017). We apply a cosineannealing learning rate scheduling (Loshchilov & Hutter, 2016) with weight decay of 1.5 10 6. On momentum constant τ, cosine-annealing was applied starting from τ = 0.996 to 1. We train for 200 epochs each with 10 warm-up epochs. For Mo Co-v3... Adam W (Loshchilov & Hutter, 2017) optimizer is used with linearly scaled learning rate, lr = 1.5 10 4. We apply weight decay of 0.1 and momentum of 0.99 with cosine-annealing and train for 100 epochs with 10 warm-up epochs. To avoid out-of-memory, we vary the batch size between 128 and 2048. |