DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that DAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. |
| Researcher Affiliation | Collaboration | Chuanyang Zheng1 Yihang Gao2 Han Shi3 Minbin Huang1 Jingyao Li1 Jing Xiong4 Xiaozhe Ren3 Michael Ng5 Xin Jiang3 Zhenguo Li3 Yu Li1 1CUHK 2NUS 3Noah s Ark Lab 4HKU 5HKBU |
| Pseudocode | No | Appendix J provides a full PyTorch implementation code block, not pseudocode. |
| Open Source Code | Yes | We have made our code publicly available to other researchers in the field. This initiative aims to facilitate a standardized comparison and evaluation of their respective methods, thereby advancing the collective understanding of model performance in relation to perplexity calculations. In this section, we present the implementation of the proposed DAPE module in Py Torch [49]. (Appendix J) [...] Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have provided the pytorch implementation code in Appendix Section J. |
| Open Datasets | Yes | Our analysis involves training language models on the Arxiv and Books3 datasets, which are frequently used benchmarks for evaluating model performance [52, 13, 41, 24]. |
| Dataset Splits | Yes | Our analysis involves training language models on the Arxiv and Books3 datasets, which are frequently used benchmarks for evaluating model performance [52, 13, 41, 24]. |
| Hardware Specification | Yes | All experiments are conducted on 8 x A800 GPUs. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., 'PyTorch 1.9') were found. Appendix J mentions 'PyTorch [49]' but without a version. |
| Experiment Setup | Yes | Table 3: Model Configurations. Training sequence length 512 512 Batch size 32 8 32 8 Numer of iterations 50k 50k Dropout prob. 0.0 0.0 Attention dropout prob. 0.0 0.0 Attention head 12 16 Feature dimension 768 1024 Layer number 12 24 Optimizer Adam Adam Optimizer parameter betas [0.9, 0.95] [0.9, 0.95] Learning rate 6e 4 3e 4 Precision float16 float16 |