reproducibilityindex.ai

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Authors: Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, Xulong Tang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Smart FRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches. In CV domain, we use three representative CNN models Res Net50, VGG11, and Mobile Net V2 (Sandler et al., 2018), and a vision transformer model Dei T-T (Touvron et al., 2021). And we use three widely-used datasets Image Net (Deng et al., 2009), CIFAR-10, and CIFAR-100.
Researcher Affiliation	Academia	1University of Pittsburgh 2Northeastern University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	In CV domain, we use three representative CNN models Res Net50, VGG11, and Mobile Net V2 (Sandler et al., 2018), and a vision transformer model Dei T-T (Touvron et al., 2021). And we use three widely-used datasets Image Net (Deng et al., 2009), CIFAR-10, and CIFAR-100. In NLP domain, we fine-tune the pre-trained BERT-base model (Kenton & Toutanova, 2019) using two datasets MRPC (Dolan & Brockett, 2005) and Co LA (Warstadt et al., 2019) in GLUE benchmark (Wang et al., 2019).
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts). It mentions "The training data is divided into batches with a size of 32" which refers to batch size, not data splitting percentages.
Hardware Specification	Yes	We conduct our experiments on two servers: i) a server with 8 NVIDIA RTX 2080Ti GPUs is used for the experiments on Image Net dataset and ii) an NVIDIA Tesla P100 GPU server is used in all other experiments.
Software Dependencies	No	The paper mentions using "native training frameworks such as Py Torch/Tensor Flow" and the "SGD optimizer with momentum" but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	Both the predictor and the target networks are trained using the SGD optimizer with momentum. The training data is divided into batches with a size of 32. In CNN models, we freeze the BN layer together with its corresponding CONV layer to avoid unnecessary computation costs in back-propagation. The attention-based lightweight predictor is trained once on Image Net using Res Net50 and then used in different models and datasets. The attention window size for the predictor is 30. And we tailor all the layers into the size of 1024 by random sampling to fit into the generic predictor. All the accuracy results in this paper are the average of 5 runs using different random seeds. And the overhead introduced by predictor is included in the results. Then the CNN models and Vi T model are trained for 10 epochs and 100 epochs to converge, respectively, using a cosine annealing learning rate scheduler according to the training epochs.