Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling
Authors: Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang
ICLR 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct the experiments of Bi-Blo SAN and several popular RNN/CNN/SAN-based sequence encoding models on nine benchmark datasets for multiple different NLP tasks. A thorough comparison on nine benchmark datasets demonstrates the advantages of Bi-Blo SAN in terms of training speed, inference accuracy and memory consumption. Figure 1 shows that Bi-Blo SAN obtains the best accuracy by costing similar training time to Di SAN, and as little memory as Bi-LSTM, Bi-GRU and multi-head attention. |
| Researcher Affiliation | Academia | Tao Shen , Tianyi Zhou , Guodong Long , Jing Jiang & Chengqi Zhang Centre for Artificial Intelligence, School of Software, University of Technology Sydney Paul G. Allen School of Computer Science & Engineering, University of Washington |
| Pseudocode | No | The paper describes the proposed model and its components using textual descriptions and diagrams, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code and scripts for experiments are at https://github.com/taoshen58/Bi Blo SA |
| Open Datasets | Yes | Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples. |
| Dataset Splits | Yes | Stanford Natural Language Inference (Bowman et al., 2015) (SNLI)2 dataset, which contains standard training/dev/test split of 549,367/9,842/9,824 samples. |
| Hardware Specification | Yes | All experimental codes are implemented in Python with Tensorflow and run on a single Nvidia GTX 1080Ti graphic card. |
| Software Dependencies | Yes | Both time cost and memory load data are collected under Tensorflow1.3 with CUDA8 and cu DNN6021. |
| Experiment Setup | Yes | Training Setup: The optimization objective is the cross-entropy loss plus L2 regularization penalty. We minimize the objective by Adadelta (Zeiler, 2012) optimizer... The batch size is set to 64 for all methods. The training phase takes 50 epochs to converge. All weight matrices are initialized by Glorot Initialization... The Dropout (Srivastava et al., 2014) keep probability and the L2 regularization weight decay factor γ are set to 0.75 and 5 10 5, respectively. The number of hidden units is 300. The unspecified activation functions in all models are set to Relu (Glorot et al., 2011). |