reproducibilityindex.ai

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Authors: Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, Jiwei Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classiﬁcation, we demonstrate SAC is competitive with state-of-the-art models while signiﬁcantly reducing memory cost.
Researcher Affiliation	Collaboration	Computer Science Department, Zhejiang University Shannon.AI {xiaoya_li,yuxian_meng,mingxin_zhou,qinghong_han,jiwei_li}@shannonai.com
Pseudocode	No	The paper describes the Edge Predictor and its steps in detail within the text and using a figure, but it does not include a formal pseudocode block or algorithm box.
Open Source Code	No	The paper does not contain any explicit statement about releasing open-source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	Following Vaswani et al. (2017); Ott et al. (2018); Kitaev et al. (2020), we used the standard WMT 2014 English-German dataset to test the proposed model. The dataset consists of about 4.5 million sentence pairs. Sentences are encoded using BPE (Sennrich et al., 2016), which has a shared source target vocabulary of about 37000 tokens.
Dataset Splits	Yes	Table 1: BLEU scores on the newstest2013 for development and newstest2014 for test for WMT English-German. and We train all models with Adam (Kingma and Ba, 2014) and early stopping on the validation set.
Hardware Specification	Yes	Models are run on 8 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions optimizers (Adam) and tools (BPE, Stanford Dependency parser) and model architectures, but does not specify software dependencies with version numbers (e.g., 'Python 3.x', 'PyTorch 1.x', 'CUDA 11.x').
Experiment Setup	Yes	For fair comparison, we used the Adam optimizer (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 9 for all models. Label smoothing (Szegedy et al., 2016) with ϵ = 0.1 is applied for all models. For the base setup, following Vaswani et al. (2017), the dimensionality of inputs and outputs dmodel is set to 512, and the inner-layer has dimensionality dff is set to 2,048.