Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure
Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67 of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. |
| Researcher Affiliation | Collaboration | Hanseul Cho Jaeyoung Cha Graduate School of AI, KAIST {jhs4015,chajaeyoung}@kaist.ac.kr Pranjal Awasthi Google Research pranjalawasthi@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Anupam Gupta NYU & Google Research anupam.g@nyu.edu Chulhee Yun Graduate School of AI, KAIST chulhee.yun@kaist.ac.kr |
| Pseudocode | No | No pseudocode or algorithm block is explicitly labeled or presented in a structured format. |
| Open Source Code | Yes | Our codebase is available at github.com/Hanseul Jo/position-coupling. |
| Open Datasets | No | Data Sampling. We opt for the balanced sampling in terms of the number of digits (Nogueira et al., 2021). Given the maximum number of digits Dmax, we do balanced sampling for each operand in two steps. First, we sample the number of digits D [1, Dmax] uniformly at random. Next, we sample an operand from [10D 1, 10D 1] uniformly at random, except for D = 1 where we sample from [0, 9]. This procedure addresses the imbalance problem in the number of digits of operands. |
| Dataset Splits | Yes | For each run of training, we choose and evaluate the best model in terms of the validation loss for 200-digit additions. |
| Hardware Specification | Yes | Device NVIDIA RTX A6000 48GB |
| Software Dependencies | Yes | We additionally implement a custom RMSNorm module (Zhang and Sennrich, 2019) and various positioning schemes of normalization layers (e.g., Pre Norm (Xiong et al., 2020), Post Norm (Vaswani et al., 2017), and their combination), to follow the implementation details of Zhou et al. (2024b). ... established on top of Py Torch (Paszke et al., 2019) and Huggingface |
| Experiment Setup | Yes | We summarize all hyperparameters in Appendix C. Table 1: Hyperparameter summary for decimal integer addition task: comparison between trained lengths (Figures 3 and 14). ... Training Steps 50,000 Batch Size 1,000 Optimizer Adam (Kingma and Ba, 2015) Learning Rate (LR) 0.0001 LR Warm-up Linear (From 0 to LR), 1% of total steps LR Cool-down Cosine Decay (From LR to 0.1LR) Maximum Position ID (max_pos) 202 |