reproducibilityindex.ai

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67 of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it.
Researcher Affiliation	Collaboration	Hanseul Cho Jaeyoung Cha Graduate School of AI, KAIST {jhs4015,chajaeyoung}@kaist.ac.kr Pranjal Awasthi Google Research pranjalawasthi@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Anupam Gupta NYU & Google Research anupam.g@nyu.edu Chulhee Yun Graduate School of AI, KAIST chulhee.yun@kaist.ac.kr
Pseudocode	No	No pseudocode or algorithm block is explicitly labeled or presented in a structured format.
Open Source Code	Yes	Our codebase is available at github.com/Hanseul Jo/position-coupling.
Open Datasets	No	Data Sampling. We opt for the balanced sampling in terms of the number of digits (Nogueira et al., 2021). Given the maximum number of digits Dmax, we do balanced sampling for each operand in two steps. First, we sample the number of digits D [1, Dmax] uniformly at random. Next, we sample an operand from [10D 1, 10D 1] uniformly at random, except for D = 1 where we sample from [0, 9]. This procedure addresses the imbalance problem in the number of digits of operands.
Dataset Splits	Yes	For each run of training, we choose and evaluate the best model in terms of the validation loss for 200-digit additions.
Hardware Specification	Yes	Device NVIDIA RTX A6000 48GB
Software Dependencies	Yes	We additionally implement a custom RMSNorm module (Zhang and Sennrich, 2019) and various positioning schemes of normalization layers (e.g., Pre Norm (Xiong et al., 2020), Post Norm (Vaswani et al., 2017), and their combination), to follow the implementation details of Zhou et al. (2024b). ... established on top of Py Torch (Paszke et al., 2019) and Huggingface
Experiment Setup	Yes	We summarize all hyperparameters in Appendix C. Table 1: Hyperparameter summary for decimal integer addition task: comparison between trained lengths (Figures 3 and 14). ... Training Steps 50,000 Batch Size 1,000 Optimizer Adam (Kingma and Ba, 2015) Learning Rate (LR) 0.0001 LR Warm-up Linear (From 0 to LR), 1% of total steps LR Cool-down Cosine Decay (From LR to 0.1LR) Maximum Position ID (max_pos) 202