Exploring Transformer Extrapolation
Authors: Zhen Qin, Yiran Zhong, Hui Deng
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on the Wikitext-103, Books, Github, and Wiki Book datasets to demonstrate the viability of our discovered conditions. |
| Researcher Affiliation | Collaboration | 1Open NLPLab, Shanghai AI Lab, Shanghai, China 2Tap Tap, Shanghai, China 3Northwestern Polytechnical University, Shaanxi, China |
| Pseudocode | No | The paper presents mathematical proofs and equations (e.g., Eq. 1 - 28) but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code is released at: https://github.com/Open NLPLab/Rpe. |
| Open Datasets | Yes | We conduct experiments on Wikitext-103 (Merity et al. 2016), Books (Zhu et al. 2015), Github (Gao et al. 2020) and Wiki Book (Wettig et al. 2022). |
| Dataset Splits | No | The paper mentions 'max training length during training is 512' and evaluates on 'testing PPLs' by scaling inference length, but it does not provide specific train/validation/test splits or a clear description of a validation set's role and size. |
| Hardware Specification | Yes | All models are implemented in Fairseq (Ott et al. 2019) and trained on 8 V100 GPUs. |
| Software Dependencies | No | The paper states 'All models are implemented in Fairseq (Ott et al. 2019)' but does not provide specific version numbers for Fairseq or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | We use the same model architecture and training configuration for all RPE variants to ensure fairness. For Wikitext-103 (Merity et al. 2016), since it is a relatively small dataset, we use a 6-layer transformer decoder structure with an embedding size of 512. For other datasets, in particular, we used a 12-layer transformer decoder structure with an embedding size of 768. The evaluation metric is perplexity (PPL) and the max training length during training is 512. The detailed hyper-parameter settings are listed in Appendix. |