Faster Depth-Adaptive Transformers

Authors: Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, Jinan Xu13424-13432

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on the text classification task with 24 datasets in various sizes and domains. Results confirm that our approaches can speed up the vanilla Transformer (up to 7x) while preserving high accuracy.
Researcher Affiliation Collaboration 1Beijing Jiaotong University, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China {yijinliu, fandongmeng, withtomzhou}@tencent.com {chenyf, jaxu}@bjtu.edu.cn
Pseudocode No The paper describes procedures and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No 1Codes will appear at https://github.com/Adaxry/Adaptive Transformer
Open Datasets Yes We conduct extensive experiments on the 24 popular benchmarks collected from diverse domains (e.g., topic, sentiment) ranging from modestly sized to large-scaled. The statistics of these datasets are listed in Table 1. Examples include TREC (Li and Roth 2002), AG s News (Zhang, Zhao, and Le Cun 2015), IMDB (Maas et al. 2011).
Dataset Splits Yes The statistics of these datasets are listed in Table 1. CV refers to 5-fold cross-validation. Table 1 includes 'Train Sample' and 'Test Sample' columns.
Hardware Specification Yes speed is the number of samples calculated in ten-second on one Tesla P40 GPU with the batch size of 1.
Software Dependencies No The paper mentions software components like BERT and Adam optimizer, but does not provide specific version numbers for any libraries or frameworks (e.g., 'Python 3.x', 'PyTorch 1.y', or specific BERT version).
Experiment Setup Yes Dropout (Srivastava et al. 2014) is applied to word embeddings, residual connection , and attention scores with a rate of 0.1. Models are optimized by the Adam optimizer (Kingma and Ba 2014) with gradient clipping of 5 (Pascanu, Mikolov, and Bengio 2013). BERTbase is used to initialize the Transformer encoder. Long sentences exceed 512 words are clipped. The penalty factor λ in the reconstruction loss based approach is set to 0.1.