Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
Authors: Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments. |
| Researcher Affiliation | Collaboration | Yanlai Yang1, Matt Jones2, Michael C. Mozer3,2, and Mengye Ren1 1New York University, 2University of Colorado, Boulder, 3Google DeepMind |
| Pseudocode | No | Just as when training a deep net, we assume here that representation learning occurs slowly, and that one training step for task i involves a single gradient update of P with step size "alpha": P P "alpha"(P xi fi(w))x i . (1) In contrast, at each training step, w, analogous to the fast-adapting weights in a neural network, can be rapidly tuned to solve for task i, yielding the loss minimizer conditional on P : w f 1 i (P xi). (2) |
| Open Source Code | Yes | We provide the code and instructions for reproducing main experimental results in the supplementary material. |
| Open Datasets | Yes | For the LLM experiments, we use the CNN/Daily Mail news dataset [17]. For the vision experiments, we use images sampled from CIFAR-10 [18] and ImageNet [19]. |
| Dataset Splits | No | We use the same documents for both training and evaluation. Our goal here is not to determine whether a trained model generalizes to new documents, but rather to study the memory for a particular document as a function of position within the training history. |
| Hardware Specification | Yes | Each experiment presented in the paper is run with one NVIDIA A100 GPU, 2 CPUs, and 32GB of RAM. |
| Software Dependencies | No | We use the Huggingface Transformers Library [63] for fine-tuning the LLMs. |
| Experiment Setup | Yes | Unless otherwise stated, the default hyperparameters in the subsequent experiments are T = 25, C = 256, M = 10, E = 5. We use the average cross entropy loss (average negative log-likelihood for each token) as our training and evaluation metric. The learning rate 0.001 for vanilla gradient descent and 0.00001 for Adam. |