reproducibilityindex.ai

Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics

Authors: Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, Stuart J Russell

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we theoretically analyze the reversal curse via the training dynamics of (stochastic) gradient descent for two auto-regressive models: (1) a bilinear model that can be viewed as a simplification of a one-layer transformer; (2) one-layer transformers under certain assumptions. Our analysis reveals that for both models, the reversal curse is a consequence of the (effective) model weights asymmetry, i.e., the increase of weights from a token A to token B during training does not necessarily cause the increase of the weights from B to A, which is caused by the training dynamics under certain choice of loss function and the optimization space of model parameters. Moreover, our analysis can be naturally applied to other logical reasoning tasks such as chain-of-thought (COT), which provides a new perspective different from previous work that focuses on expressivity. Finally, we conduct experiments to validate our theory on multi-layer transformers under different settings.
Researcher Affiliation	Collaboration	Hanlin Zhu UC Berkeley hanlinzhu@berkeley.edu Baihe Huang UC Berkeley baihe_huang@berkeley.edu Shaolun Zhang UC Berkeley shaolun_zhang@berkeley.edu Michael Jordan UC Berkeley jordan@cs.berkeley.edu Jiantao Jiao UC Berkeley jiantao@berkeley.edu Yuandong Tian Meta AI yuandong@meta.com Stuart Russell UC Berkeley russell@cs.berkeley.edu
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Our code is available at https://github.com/marlo-z/reversal_curse_analysis/.
Open Datasets	No	We choose the vocabulary V = {0, 1, . . . , N} for a specified N > 0. We randomly sample two disjoint sets of entities A, B V with \|A\|= \|B\|= \|V\|/4, and reserve two additional tokens for relationships and , respectively. Next, we specify a bijection from A to B uniformly at random. For each Ai A and its corresponding Bi B, we can obtain a pair of sequence (Ai Bi, Bi Ai). We split the set of all pairs into training pairs and validation pairs.
Dataset Splits	Yes	We split the set of all pairs into training pairs and validation pairs. For each training pair, both sequences will be included in the training set, while for the validation pair, we randomly select one sequence for the training set and the other for the validation set. Therefore, the model will learn both directions for the training pairs and only one direction for each validation pair while being tested in the unseen direction. ... The training set size is 340, and the validation set size is 60 (resulting from 140 training pairs and 60 validation pairs).
Hardware Specification	Yes	We run each trial on an Nvidia A100 GPU and it typically takes 0.5-1.5 hours for each trial.
Software Dependencies	No	We used the GPT2 model architecture [63] and trained the model with the Adam W optimizer for 3000 epochs of batch size 64. See Table 2 for a full list of hyperparameters. The paper does not explicitly list software versions for libraries like PyTorch or the Hugging Face Transformers library.
Experiment Setup	Yes	For both the reversal curse and COT experiments, we used the GPT2 model architecture [63] and trained the model with the Adam W optimizer for 3000 epochs of batch size 64. See Table 2 for a full list of hyperparameters. We also conducted experiments with various model configurations and vocabulary sizes to show that the results in Section 5 and appendix D are consistent under different settings. See Table 3 for a complete list of different configurations, where the default choices are boldened.