reproducibilityindex.ai

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Authors: Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.
Researcher Affiliation	Collaboration	Yanlai Yang1, Matt Jones2, Michael C. Mozer3,2, and Mengye Ren1 1New York University, 2University of Colorado, Boulder, 3Google DeepMind
Pseudocode	No	Just as when training a deep net, we assume here that representation learning occurs slowly, and that one training step for task i involves a single gradient update of P with step size "alpha": P P "alpha"(P xi fi(w))x i . (1) In contrast, at each training step, w, analogous to the fast-adapting weights in a neural network, can be rapidly tuned to solve for task i, yielding the loss minimizer conditional on P : w f 1 i (P xi). (2)
Open Source Code	Yes	We provide the code and instructions for reproducing main experimental results in the supplementary material.
Open Datasets	Yes	For the LLM experiments, we use the CNN/Daily Mail news dataset [17]. For the vision experiments, we use images sampled from CIFAR-10 [18] and ImageNet [19].
Dataset Splits	No	We use the same documents for both training and evaluation. Our goal here is not to determine whether a trained model generalizes to new documents, but rather to study the memory for a particular document as a function of position within the training history.
Hardware Specification	Yes	Each experiment presented in the paper is run with one NVIDIA A100 GPU, 2 CPUs, and 32GB of RAM.
Software Dependencies	No	We use the Huggingface Transformers Library [63] for fine-tuning the LLMs.
Experiment Setup	Yes	Unless otherwise stated, the default hyperparameters in the subsequent experiments are T = 25, C = 256, M = 10, E = 5. We use the average cross entropy loss (average negative log-likelihood for each token) as our training and evaluation metric. The learning rate 0.001 for vanilla gradient descent and 0.00001 for Adam.