Evolving Attention with Residual Convolutions
Authors: Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, Yunhai Tong
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments have demonstrated consistent improvement in various natural language and computer vision tasks. |
| Researcher Affiliation | Collaboration | 1Peking University 2Microsoft Research 3Institute of Information Engineering, Chinese Academy of Sciences 4ETH Zurich 5Tsinghua University. |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | The code is available at https://github.com/pkuyym/Evolving Attention |
| Open Datasets | Yes | We choose GLUE benchmark (Wang et al., 2018) for an empirical study. |
| Dataset Splits | Yes | We leverage 10% training data to choose the hyper-parameters and perform evaluation on the development set. |
| Hardware Specification | Yes | All models are trained by 1.28 million training images for 100 epochs on 8 TESLA V100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used. |
| Experiment Setup | Yes | Major hyper-parameters are as follows: optimizer is SGD with momentum 0.9, batch size is 32 per worker, weight decay is 1e-4. For the first 5 epochs, the learning rate is scaled linearly from 0 to 0.128, and then it is divided by 10 at epoch 30, 60, 80 and 90. |