Sparse Token Transformer with Attention Back Tracking

Authors: Heejun Lee, Minki Kang, Youngwan Lee, Sung Ju Hwang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate the effectiveness of the method on both NLP and CV benchmarks, using Transformer architectures for both domains, and the results show that the proposed attention back-tracking allows the model to better retain the full models performance even at high sparsity rates, significantly outperforming all baselines.
Researcher Affiliation Collaboration Heejun Lee1,2 Minki Kang1,3 Youngwan Lee1,4 Sung Ju Hwang1 KAIST1, Deep Auto.ai2 , AITRICS3, ETRI4 {ainl, zzxc1133}@kaist.ac.kr yw.lee@etri.ac.kr sjhwang82@kaist.ac.kr
Pseudocode Yes Algorithm 1: Update token mask from the output token indices
Open Source Code No We will introduce code construction and data collection in this section for reproducibility. Model Implementation. First we construct our model with Py Torch (Paszke et al., 2019) and Huggingface (Wolf et al., 2020). We modify bert-base-uncased from Huggingface Models (Wolf et al., 2020). Approx Net. We create Approx Net ( 3.2, trainer/glue base.by) with the pretrained models. We implement Algorithm 1 on top of bert-base-uncased (Devlin et al., 2019) (models/sparse token.py). The paper describes code implementation details and specific file paths but does not provide a direct link or an explicit statement of code release.
Open Datasets Yes We use nine datasets from GLUE (Wang et al., 2019) benchmark for the text classification and use the BERTbase (Devlin et al., 2019) as the base model. For image classification, we validate it on Image Net-1K (Deng et al., 2009) benchmark with Dei T (Touvron et al., 2021).
Dataset Splits Yes We use nine datasets from GLUE (Wang et al., 2019) benchmark for the text classification and use the BERTbase (Devlin et al., 2019) as the base model. For image classification, we validate it on Image Net-1K (Deng et al., 2009) benchmark with Dei T (Touvron et al., 2021). Also, 'LTP (Best valid.)' in Figure 3 legend.
Hardware Specification No The paper discusses computational costs in terms of FLOPs but does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments.
Software Dependencies No First we construct our model with Py Torch (Paszke et al., 2019) and Huggingface (Wolf et al., 2020). This mentions software names but lacks specific version numbers.
Experiment Setup Yes We match the training settings of STTABT and LTP, then follow the hyperparameter search strategies described in (Kim et al., 2022). For instance, for tasks in GLUE benchmark, we observe θl does not change a lot during the training. Therefore, we use a low λp around 1e 3 in GLUE task training. On the other hand, we observe the model with concrete masking tend to fail in keeping the target token retention ration in the image classification task with Vi T. In this case, we use a high λmask around 100 to keep the token retention ratio close to the desired target value.