Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67 of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it.
Researcher Affiliation Collaboration Hanseul Cho Jaeyoung Cha Graduate School of AI, KAIST {jhs4015,chajaeyoung}@kaist.ac.kr Pranjal Awasthi Google Research pranjalawasthi@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Anupam Gupta NYU & Google Research anupam.g@nyu.edu Chulhee Yun Graduate School of AI, KAIST chulhee.yun@kaist.ac.kr
Pseudocode No No pseudocode or algorithm block is explicitly labeled or presented in a structured format.
Open Source Code Yes Our codebase is available at github.com/Hanseul Jo/position-coupling.
Open Datasets No Data Sampling. We opt for the balanced sampling in terms of the number of digits (Nogueira et al., 2021). Given the maximum number of digits Dmax, we do balanced sampling for each operand in two steps. First, we sample the number of digits D [1, Dmax] uniformly at random. Next, we sample an operand from [10D 1, 10D 1] uniformly at random, except for D = 1 where we sample from [0, 9]. This procedure addresses the imbalance problem in the number of digits of operands.
Dataset Splits Yes For each run of training, we choose and evaluate the best model in terms of the validation loss for 200-digit additions.
Hardware Specification Yes Device NVIDIA RTX A6000 48GB
Software Dependencies Yes We additionally implement a custom RMSNorm module (Zhang and Sennrich, 2019) and various positioning schemes of normalization layers (e.g., Pre Norm (Xiong et al., 2020), Post Norm (Vaswani et al., 2017), and their combination), to follow the implementation details of Zhou et al. (2024b). ... established on top of Py Torch (Paszke et al., 2019) and Huggingface
Experiment Setup Yes We summarize all hyperparameters in Appendix C. Table 1: Hyperparameter summary for decimal integer addition task: comparison between trained lengths (Figures 3 and 14). ... Training Steps 50,000 Batch Size 1,000 Optimizer Adam (Kingma and Ba, 2015) Learning Rate (LR) 0.0001 LR Warm-up Linear (From 0 to LR), 1% of total steps LR Cool-down Cosine Decay (From LR to 0.1LR) Maximum Position ID (max_pos) 202