reproducibilityindex.ai

Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Authors: Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct the experiments of Bi-Blo SAN and several popular RNN/CNN/SAN-based sequence encoding models on nine benchmark datasets for multiple different NLP tasks. A thorough comparison on nine benchmark datasets demonstrates the advantages of Bi-Blo SAN in terms of training speed, inference accuracy and memory consumption. Figure 1 shows that Bi-Blo SAN obtains the best accuracy by costing similar training time to Di SAN, and as little memory as Bi-LSTM, Bi-GRU and multi-head attention.
Researcher Affiliation	Academia	Tao Shen , Tianyi Zhou , Guodong Long , Jing Jiang & Chengqi Zhang Centre for Artiﬁcial Intelligence, School of Software, University of Technology Sydney Paul G. Allen School of Computer Science & Engineering, University of Washington
Pseudocode	No	The paper describes the proposed model and its components using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Source code and scripts for experiments are at https://github.com/taoshen58/Bi Blo SA
Open Datasets	Yes	Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples.
Dataset Splits	Yes	Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples.
Hardware Specification	Yes	All experimental codes are implemented in Python with Tensorﬂow and run on a single Nvidia GTX 1080Ti graphic card.
Software Dependencies	Yes	Both time cost and memory load data are collected under Tensorﬂow1.3 with CUDA8 and cu DNN6021.
Experiment Setup	Yes	Training Setup: The optimization objective is the cross-entropy loss plus L2 regularization penalty. We minimize the objective by Adadelta (Zeiler, 2012) optimizer... The batch size is set to 64 for all methods. The training phase takes 50 epochs to converge. All weight matrices are initialized by Glorot Initialization... The Dropout (Srivastava et al., 2014) keep probability and the L2 regularization weight decay factor γ are set to 0.75 and 5 10 5, respectively. The number of hidden units is 300. The unspeciﬁed activation functions in all models are set to Relu (Glorot et al., 2011).