Sparse Token Transformer with Attention Back Tracking
Authors: Heejun Lee, Minki Kang, Youngwan Lee, Sung Ju Hwang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate the effectiveness of the method on both NLP and CV benchmarks, using Transformer architectures for both domains, and the results show that the proposed attention back-tracking allows the model to better retain the full models performance even at high sparsity rates, significantly outperforming all baselines. |
| Researcher Affiliation | Collaboration | Heejun Lee1,2 Minki Kang1,3 Youngwan Lee1,4 Sung Ju Hwang1 KAIST1, Deep Auto.ai2 , AITRICS3, ETRI4 {ainl, zzxc1133}@kaist.ac.kr yw.lee@etri.ac.kr sjhwang82@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1: Update token mask from the output token indices |
| Open Source Code | No | We will introduce code construction and data collection in this section for reproducibility. Model Implementation. First we construct our model with Py Torch (Paszke et al., 2019) and Huggingface (Wolf et al., 2020). We modify bert-base-uncased from Huggingface Models (Wolf et al., 2020). Approx Net. We create Approx Net ( 3.2, trainer/glue base.by) with the pretrained models. We implement Algorithm 1 on top of bert-base-uncased (Devlin et al., 2019) (models/sparse token.py). The paper describes code implementation details and specific file paths but does not provide a direct link or an explicit statement of code release. |
| Open Datasets | Yes | We use nine datasets from GLUE (Wang et al., 2019) benchmark for the text classification and use the BERTbase (Devlin et al., 2019) as the base model. For image classification, we validate it on Image Net-1K (Deng et al., 2009) benchmark with Dei T (Touvron et al., 2021). |
| Dataset Splits | Yes | We use nine datasets from GLUE (Wang et al., 2019) benchmark for the text classification and use the BERTbase (Devlin et al., 2019) as the base model. For image classification, we validate it on Image Net-1K (Deng et al., 2009) benchmark with Dei T (Touvron et al., 2021). Also, 'LTP (Best valid.)' in Figure 3 legend. |
| Hardware Specification | No | The paper discusses computational costs in terms of FLOPs but does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for conducting the experiments. |
| Software Dependencies | No | First we construct our model with Py Torch (Paszke et al., 2019) and Huggingface (Wolf et al., 2020). This mentions software names but lacks specific version numbers. |
| Experiment Setup | Yes | We match the training settings of STTABT and LTP, then follow the hyperparameter search strategies described in (Kim et al., 2022). For instance, for tasks in GLUE benchmark, we observe θl does not change a lot during the training. Therefore, we use a low λp around 1e 3 in GLUE task training. On the other hand, we observe the model with concrete masking tend to fail in keeping the target token retention ration in the image classification task with Vi T. In this case, we use a high λmask around 100 to keep the token retention ratio close to the desired target value. |