Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Authors: Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct the experiments of Bi-Blo SAN and several popular RNN/CNN/SAN-based sequence encoding models on nine benchmark datasets for multiple different NLP tasks. A thorough comparison on nine benchmark datasets demonstrates the advantages of Bi-Blo SAN in terms of training speed, inference accuracy and memory consumption. Figure 1 shows that Bi-Blo SAN obtains the best accuracy by costing similar training time to Di SAN, and as little memory as Bi-LSTM, Bi-GRU and multi-head attention.
Researcher Affiliation Academia Tao Shen , Tianyi Zhou , Guodong Long , Jing Jiang & Chengqi Zhang Centre for Artificial Intelligence, School of Software, University of Technology Sydney Paul G. Allen School of Computer Science & Engineering, University of Washington
Pseudocode No The paper describes the proposed model and its components using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source code and scripts for experiments are at https://github.com/taoshen58/Bi Blo SA
Open Datasets Yes Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples.
Dataset Splits Yes Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples.
Hardware Specification Yes All experimental codes are implemented in Python with Tensorflow and run on a single Nvidia GTX 1080Ti graphic card.
Software Dependencies Yes Both time cost and memory load data are collected under Tensorflow1.3 with CUDA8 and cu DNN6021.
Experiment Setup Yes Training Setup: The optimization objective is the cross-entropy loss plus L2 regularization penalty. We minimize the objective by Adadelta (Zeiler, 2012) optimizer... The batch size is set to 64 for all methods. The training phase takes 50 epochs to converge. All weight matrices are initialized by Glorot Initialization... The Dropout (Srivastava et al., 2014) keep probability and the L2 regularization weight decay factor γ are set to 0.75 and 5 10 5, respectively. The number of hidden units is 300. The unspecified activation functions in all models are set to Relu (Glorot et al., 2011).