Encoding Recurrence into Transformers

Authors: Feiqing Huang, Kexin Lu, Yuxi CAI, Zhen Qin, Yanwen Fang, Guangjian Tian, Guodong Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section contains four sequential modeling tasks and, for each task, we modify some popular Transformer baselines by adding the REMs to their attention weights via a gated mechanism as in (4).
Researcher Affiliation Collaboration Feiqing Huang1 , Kexin Lu1 , Yuxi Cai1, Zhen Qin2, Yanwen Fang1 Guangjian Tian2, Guodong Li1 Department of Statistics and Actuarial Science, The University of Hong Kong1 Huawei Noah s Ark Lab2
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No For RSA-Code T5-small, our code is based on the original Code T5, which can be referred to in the repository https://github.com/salesforce/Code T5.
Open Datasets Yes Our experiments are performed on two public benchmark datasets: the ETT1 dataset is comprised of seven features related to the electric power long-term deployment, where {ETTh1, ETTh2} are recorded by the hour and ETTm1 is recorded by 15-minute intervals; and the Weather2 dataset contains twelve climate indicators collected every 1 hour over a 4-year period. and 1Accessible at https://github.com/zhouhaoyi/ETDataset. and The six types of regular language datasets are obtained from the github repository Transformer Formal-Language.
Dataset Splits Yes All hyperparameters in baseline models are set to the optimal setting in Zhou et al. (2021), and we also follow their train/val/test division and training schemes to conduct our experiments. and The train/val/test is 12/4/4 months.
Hardware Specification Yes All experiments are conducted on Nvidia V100 32GB GPUs.
Software Dependencies No The paper mentions using specific hardware (Nvidia V100 32GB GPUs) but does not provide specific version numbers for software dependencies such as programming languages or libraries used for implementation.
Experiment Setup Yes All hyperparameters in baseline models are set to the optimal setting in Zhou et al. (2021), and we also follow their train/val/test division and training schemes to conduct our experiments. and The gate-control parameter µ is initialized in the interval [ 3, 3] for all the layers. For (λ, γ, θ) which determine the recurrent patterns, we initialize λ s at different heads to spread out between [ 2, 1] [1, 2] and ν s to spread out between [1, 2], and θ is initialized at π/4, to encourage REMs to be non-zero and well-diversified. and Appendix E.3 a 4 layer transformer with 8 attention heads are trained using a batch size of 16 and the Adam optimizer with a learning rate of 0.0001 which is reduced by half every 5 epochs.