Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Authors: Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.
Researcher Affiliation Collaboration Yanlai Yang1, Matt Jones2, Michael C. Mozer3,2, and Mengye Ren1 1New York University, 2University of Colorado, Boulder, 3Google DeepMind
Pseudocode No Just as when training a deep net, we assume here that representation learning occurs slowly, and that one training step for task i involves a single gradient update of P with step size "alpha": P P "alpha"(P xi fi(w))x i . (1) In contrast, at each training step, w, analogous to the fast-adapting weights in a neural network, can be rapidly tuned to solve for task i, yielding the loss minimizer conditional on P : w f 1 i (P xi). (2)
Open Source Code Yes We provide the code and instructions for reproducing main experimental results in the supplementary material.
Open Datasets Yes For the LLM experiments, we use the CNN/Daily Mail news dataset [17]. For the vision experiments, we use images sampled from CIFAR-10 [18] and ImageNet [19].
Dataset Splits No We use the same documents for both training and evaluation. Our goal here is not to determine whether a trained model generalizes to new documents, but rather to study the memory for a particular document as a function of position within the training history.
Hardware Specification Yes Each experiment presented in the paper is run with one NVIDIA A100 GPU, 2 CPUs, and 32GB of RAM.
Software Dependencies No We use the Huggingface Transformers Library [63] for fine-tuning the LLMs.
Experiment Setup Yes Unless otherwise stated, the default hyperparameters in the subsequent experiments are T = 25, C = 256, M = 10, E = 5. We use the average cross entropy loss (average negative log-likelihood for each token) as our training and evaluation metric. The learning rate 0.001 for vanilla gradient descent and 0.00001 for Adam.