Encoding Recurrence into Transformers
Authors: Feiqing Huang, Kexin Lu, Yuxi CAI, Zhen Qin, Yanwen Fang, Guangjian Tian, Guodong Li
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section contains four sequential modeling tasks and, for each task, we modify some popular Transformer baselines by adding the REMs to their attention weights via a gated mechanism as in (4). |
| Researcher Affiliation | Collaboration | Feiqing Huang1 , Kexin Lu1 , Yuxi Cai1, Zhen Qin2, Yanwen Fang1 Guangjian Tian2, Guodong Li1 Department of Statistics and Actuarial Science, The University of Hong Kong1 Huawei Noah s Ark Lab2 |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | For RSA-Code T5-small, our code is based on the original Code T5, which can be referred to in the repository https://github.com/salesforce/Code T5. |
| Open Datasets | Yes | Our experiments are performed on two public benchmark datasets: the ETT1 dataset is comprised of seven features related to the electric power long-term deployment, where {ETTh1, ETTh2} are recorded by the hour and ETTm1 is recorded by 15-minute intervals; and the Weather2 dataset contains twelve climate indicators collected every 1 hour over a 4-year period. and 1Accessible at https://github.com/zhouhaoyi/ETDataset. and The six types of regular language datasets are obtained from the github repository Transformer Formal-Language. |
| Dataset Splits | Yes | All hyperparameters in baseline models are set to the optimal setting in Zhou et al. (2021), and we also follow their train/val/test division and training schemes to conduct our experiments. and The train/val/test is 12/4/4 months. |
| Hardware Specification | Yes | All experiments are conducted on Nvidia V100 32GB GPUs. |
| Software Dependencies | No | The paper mentions using specific hardware (Nvidia V100 32GB GPUs) but does not provide specific version numbers for software dependencies such as programming languages or libraries used for implementation. |
| Experiment Setup | Yes | All hyperparameters in baseline models are set to the optimal setting in Zhou et al. (2021), and we also follow their train/val/test division and training schemes to conduct our experiments. and The gate-control parameter µ is initialized in the interval [ 3, 3] for all the layers. For (λ, γ, θ) which determine the recurrent patterns, we initialize λ s at different heads to spread out between [ 2, 1] [1, 2] and ν s to spread out between [1, 2], and θ is initialized at π/4, to encourage REMs to be non-zero and well-diversified. and Appendix E.3 a 4 layer transformer with 8 attention heads are trained using a batch size of 16 and the Adam optimizer with a learning rate of 0.0001 which is reduced by half every 5 epochs. |