Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling
Authors: Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct the experiments of Bi-Blo SAN and several popular RNN/CNN/SAN-based sequence encoding models on nine benchmark datasets for multiple different NLP tasks. A thorough comparison on nine benchmark datasets demonstrates the advantages of Bi-Blo SAN in terms of training speed, inference accuracy and memory consumption. Figure 1 shows that Bi-Blo SAN obtains the best accuracy by costing similar training time to Di SAN, and as little memory as Bi-LSTM, Bi-GRU and multi-head attention. |
| Researcher Affiliation | Academia | Tao Shen , Tianyi Zhou , Guodong Long , Jing Jiang & Chengqi Zhang Centre for Artificial Intelligence, School of Software, University of Technology Sydney Paul G. Allen School of Computer Science & Engineering, University of Washington |
| Pseudocode | No | The paper describes the proposed model and its components using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code and scripts for experiments are at https://github.com/taoshen58/Bi Blo SA |
| Open Datasets | Yes | Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples. |
| Dataset Splits | Yes | Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples. |
| Hardware Specification | Yes | All experimental codes are implemented in Python with Tensorflow and run on a single Nvidia GTX 1080Ti graphic card. |
| Software Dependencies | Yes | Both time cost and memory load data are collected under Tensorflow1.3 with CUDA8 and cu DNN6021. |
| Experiment Setup | Yes | Training Setup: The optimization objective is the cross-entropy loss plus L2 regularization penalty. We minimize the objective by Adadelta (Zeiler, 2012) optimizer... The batch size is set to 64 for all methods. The training phase takes 50 epochs to converge. All weight matrices are initialized by Glorot Initialization... The Dropout (Srivastava et al., 2014) keep probability and the L2 regularization weight decay factor γ are set to 0.75 and 5 10 5, respectively. The number of hidden units is 300. The unspecified activation functions in all models are set to Relu (Glorot et al., 2011). |