SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing
Authors: Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, Xulong Tang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Smart FRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches. In CV domain, we use three representative CNN models Res Net50, VGG11, and Mobile Net V2 (Sandler et al., 2018), and a vision transformer model Dei T-T (Touvron et al., 2021). And we use three widely-used datasets Image Net (Deng et al., 2009), CIFAR-10, and CIFAR-100. |
| Researcher Affiliation | Academia | 1University of Pittsburgh 2Northeastern University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | In CV domain, we use three representative CNN models Res Net50, VGG11, and Mobile Net V2 (Sandler et al., 2018), and a vision transformer model Dei T-T (Touvron et al., 2021). And we use three widely-used datasets Image Net (Deng et al., 2009), CIFAR-10, and CIFAR-100. In NLP domain, we fine-tune the pre-trained BERT-base model (Kenton & Toutanova, 2019) using two datasets MRPC (Dolan & Brockett, 2005) and Co LA (Warstadt et al., 2019) in GLUE benchmark (Wang et al., 2019). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts). It mentions "The training data is divided into batches with a size of 32" which refers to batch size, not data splitting percentages. |
| Hardware Specification | Yes | We conduct our experiments on two servers: i) a server with 8 NVIDIA RTX 2080Ti GPUs is used for the experiments on Image Net dataset and ii) an NVIDIA Tesla P100 GPU server is used in all other experiments. |
| Software Dependencies | No | The paper mentions using "native training frameworks such as Py Torch/Tensor Flow" and the "SGD optimizer with momentum" but does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | Both the predictor and the target networks are trained using the SGD optimizer with momentum. The training data is divided into batches with a size of 32. In CNN models, we freeze the BN layer together with its corresponding CONV layer to avoid unnecessary computation costs in back-propagation. The attention-based lightweight predictor is trained once on Image Net using Res Net50 and then used in different models and datasets. The attention window size for the predictor is 30. And we tailor all the layers into the size of 1024 by random sampling to fit into the generic predictor. All the accuracy results in this paper are the average of 5 runs using different random seeds. And the overhead introduced by predictor is included in the results. Then the CNN models and Vi T model are trained for 10 epochs and 100 epochs to converge, respectively, using a cosine annealing learning rate scheduler according to the training epochs. |