SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection
Authors: Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, Jiwei Li
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost. |
| Researcher Affiliation | Collaboration | Computer Science Department, Zhejiang University Shannon.AI {xiaoya_li,yuxian_meng,mingxin_zhou,qinghong_han,jiwei_li}@shannonai.com |
| Pseudocode | No | The paper describes the Edge Predictor and its steps in detail within the text and using a figure, but it does not include a formal pseudocode block or algorithm box. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing open-source code or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | Following Vaswani et al. (2017); Ott et al. (2018); Kitaev et al. (2020), we used the standard WMT 2014 English-German dataset to test the proposed model. The dataset consists of about 4.5 million sentence pairs. Sentences are encoded using BPE (Sennrich et al., 2016), which has a shared source target vocabulary of about 37000 tokens. |
| Dataset Splits | Yes | Table 1: BLEU scores on the newstest2013 for development and newstest2014 for test for WMT English-German. and We train all models with Adam (Kingma and Ba, 2014) and early stopping on the validation set. |
| Hardware Specification | Yes | Models are run on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions optimizers (Adam) and tools (BPE, Stanford Dependency parser) and model architectures, but does not specify software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x', 'CUDA 11.x'). |
| Experiment Setup | Yes | For fair comparison, we used the Adam optimizer (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 9 for all models. Label smoothing (Szegedy et al., 2016) with ϵ = 0.1 is applied for all models. For the base setup, following Vaswani et al. (2017), the dimensionality of inputs and outputs dmodel is set to 512, and the inner-layer has dimensionality dff is set to 2,048. |