Gradient-based Editing of Memory Examples for Online Task-free Continual Learning

Authors: Xisen Jin, Arka Sadhu, Junyi Du, Xiang Ren

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments validate the effectiveness of GMED, and our best method significantly outperforms baselines and previous state-of-the-art on five out of six datasets.
Researcher Affiliation Academia Xisen Jin Arka Sadhu Junyi Du Xiang Ren University of Southern California {xisenjin, asadhu, junyidu, xiangren@usc.edu}
Pseudocode Yes Algorithm 1: Gradient Memory EDiting with ER (ER+GMED)
Open Source Code Yes 1Code can be found at https://github.com/INK-USC/GMED.
Open Datasets Yes We use six public CL datasets in our experiments. Split / Permuted / Rotated MNIST are constructed from the MNIST [21] dataset which contains images of handwritten digits. We also employ Split CIFAR-10 and Split CIFAR-100, which comprise of 5 and 20 disjoint subsets respectively based on their class labels. Similarly, Split mini-Image Net [2] splits the mini-Image Net [10, 42] dataset into 20 disjoint subsets based on their labels.
Dataset Splits No For all MNIST experiments, each task consists of 1,000 training examples following [2]. We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets.
Hardware Specification No Computational Efficiency. We analyze the additional forward and backward computation required by ER+GMED and MIR. Compared to ER, ER+GMED adds 3 forward and 1 backward passes to estimate loss increase, and 1 backward pass to update the example. In comparison, MIR adds 3 forward and 1 backward passes with 2 of the forward passes are over a larger set of retrieval candidates. In our experiments, we found GMED has similar training time cost as MIR. In Appendix B, we report the wall-clock time, and observe the run-time of ER+GMED is 1.5 times of ER.
Software Dependencies No For model architectures, we mostly follow the setup of [2]: for the three MNIST datasets, we use a MLP classifier with 2 hidden layers with 400 hidden units each. For Split CIFAR-10, Split CIFAR-100 and Split mini-Image Net datasets, we use a Res Net-18 classifier with three times less feature maps across all layers.
Experiment Setup Yes We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets. Following [8], we tune the hyper-parameters α (editing stride) and β (regularization strength) with only the first three tasks. While γ (decay rate of the editing stride) is a hyper-parameter that may flexibly control the deviation of edited examples from their original states, we find γ=1.0 (i.e., no decay) leads to better performance in our experiments.