Evolving Attention with Residual Convolutions

Authors: Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, Yunhai Tong

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have demonstrated consistent improvement in various natural language and computer vision tasks.
Researcher Affiliation Collaboration 1Peking University 2Microsoft Research 3Institute of Information Engineering, Chinese Academy of Sciences 4ETH Zurich 5Tsinghua University.
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes The code is available at https://github.com/pkuyym/Evolving Attention
Open Datasets Yes We choose GLUE benchmark (Wang et al., 2018) for an empirical study.
Dataset Splits Yes We leverage 10% training data to choose the hyper-parameters and perform evaluation on the development set.
Hardware Specification Yes All models are trained by 1.28 million training images for 100 epochs on 8 TESLA V100 GPUs.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used.
Experiment Setup Yes Major hyper-parameters are as follows: optimizer is SGD with momentum 0.9, batch size is 32 per worker, weight decay is 1e-4. For the first 5 epochs, the learning rate is scaled linearly from 0 to 0.128, and then it is divided by 10 at epoch 30, 60, 80 and 90.