SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Authors: Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, Jiwei Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.
Researcher Affiliation Collaboration Computer Science Department, Zhejiang University Shannon.AI {xiaoya_li,yuxian_meng,mingxin_zhou,qinghong_han,jiwei_li}@shannonai.com
Pseudocode No The paper describes the Edge Predictor and its steps in detail within the text and using a figure, but it does not include a formal pseudocode block or algorithm box.
Open Source Code No The paper does not contain any explicit statement about releasing open-source code or provide a link to a code repository for the described methodology.
Open Datasets Yes Following Vaswani et al. (2017); Ott et al. (2018); Kitaev et al. (2020), we used the standard WMT 2014 English-German dataset to test the proposed model. The dataset consists of about 4.5 million sentence pairs. Sentences are encoded using BPE (Sennrich et al., 2016), which has a shared source target vocabulary of about 37000 tokens.
Dataset Splits Yes Table 1: BLEU scores on the newstest2013 for development and newstest2014 for test for WMT English-German. and We train all models with Adam (Kingma and Ba, 2014) and early stopping on the validation set.
Hardware Specification Yes Models are run on 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions optimizers (Adam) and tools (BPE, Stanford Dependency parser) and model architectures, but does not specify software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x', 'CUDA 11.x').
Experiment Setup Yes For fair comparison, we used the Adam optimizer (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 9 for all models. Label smoothing (Szegedy et al., 2016) with ϵ = 0.1 is applied for all models. For the base setup, following Vaswani et al. (2017), the dimensionality of inputs and outputs dmodel is set to 512, and the inner-layer has dimensionality dff is set to 2,048.