Faster Depth-Adaptive Transformers
Authors: Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, Jinan Xu13424-13432
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the text classification task with 24 datasets in various sizes and domains. Results confirm that our approaches can speed up the vanilla Transformer (up to 7x) while preserving high accuracy. |
| Researcher Affiliation | Collaboration | 1Beijing Jiaotong University, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China {yijinliu, fandongmeng, withtomzhou}@tencent.com {chenyf, jaxu}@bjtu.edu.cn |
| Pseudocode | No | The paper describes procedures and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | 1Codes will appear at https://github.com/Adaxry/Adaptive Transformer |
| Open Datasets | Yes | We conduct extensive experiments on the 24 popular benchmarks collected from diverse domains (e.g., topic, sentiment) ranging from modestly sized to large-scaled. The statistics of these datasets are listed in Table 1. Examples include TREC (Li and Roth 2002), AG s News (Zhang, Zhao, and Le Cun 2015), IMDB (Maas et al. 2011). |
| Dataset Splits | Yes | The statistics of these datasets are listed in Table 1. CV refers to 5-fold cross-validation. Table 1 includes 'Train Sample' and 'Test Sample' columns. |
| Hardware Specification | Yes | speed is the number of samples calculated in ten-second on one Tesla P40 GPU with the batch size of 1. |
| Software Dependencies | No | The paper mentions software components like BERT and Adam optimizer, but does not provide specific version numbers for any libraries or frameworks (e.g., 'Python 3.x', 'PyTorch 1.y', or specific BERT version). |
| Experiment Setup | Yes | Dropout (Srivastava et al. 2014) is applied to word embeddings, residual connection , and attention scores with a rate of 0.1. Models are optimized by the Adam optimizer (Kingma and Ba 2014) with gradient clipping of 5 (Pascanu, Mikolov, and Bengio 2013). BERTbase is used to initialize the Transformer encoder. Long sentences exceed 512 words are clipped. The penalty factor λ in the reconstruction loss based approach is set to 0.1. |