reproducibilityindex.ai

Faster Depth-Adaptive Transformers

Authors: Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, Jinan Xu13424-13432

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on the text classiﬁcation task with 24 datasets in various sizes and domains. Results conﬁrm that our approaches can speed up the vanilla Transformer (up to 7x) while preserving high accuracy.
Researcher Affiliation	Collaboration	1Beijing Jiaotong University, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China {yijinliu, fandongmeng, withtomzhou}@tencent.com {chenyf, jaxu}@bjtu.edu.cn
Pseudocode	No	The paper describes procedures and equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	1Codes will appear at https://github.com/Adaxry/Adaptive Transformer
Open Datasets	Yes	We conduct extensive experiments on the 24 popular benchmarks collected from diverse domains (e.g., topic, sentiment) ranging from modestly sized to large-scaled. The statistics of these datasets are listed in Table 1. Examples include TREC (Li and Roth 2002), AG s News (Zhang, Zhao, and Le Cun 2015), IMDB (Maas et al. 2011).
Dataset Splits	Yes	The statistics of these datasets are listed in Table 1. CV refers to 5-fold cross-validation. Table 1 includes 'Train Sample' and 'Test Sample' columns.
Hardware Specification	Yes	speed is the number of samples calculated in ten-second on one Tesla P40 GPU with the batch size of 1.
Software Dependencies	No	The paper mentions software components like BERT and Adam optimizer, but does not provide specific version numbers for any libraries or frameworks (e.g., 'Python 3.x', 'PyTorch 1.y', or specific BERT version).
Experiment Setup	Yes	Dropout (Srivastava et al. 2014) is applied to word embeddings, residual connection , and attention scores with a rate of 0.1. Models are optimized by the Adam optimizer (Kingma and Ba 2014) with gradient clipping of 5 (Pascanu, Mikolov, and Bengio 2013). BERTbase is used to initialize the Transformer encoder. Long sentences exceed 512 words are clipped. The penalty factor λ in the reconstruction loss based approach is set to 0.1.