reproducibilityindex.ai

Gradient-based Editing of Memory Examples for Online Task-free Continual Learning

Authors: Xisen Jin, Arka Sadhu, Junyi Du, Xiang Ren

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments validate the effectiveness of GMED, and our best method signiﬁcantly outperforms baselines and previous state-of-the-art on ﬁve out of six datasets.
Researcher Affiliation	Academia	Xisen Jin Arka Sadhu Junyi Du Xiang Ren University of Southern California {xisenjin, asadhu, junyidu, xiangren@usc.edu}
Pseudocode	Yes	Algorithm 1: Gradient Memory EDiting with ER (ER+GMED)
Open Source Code	Yes	1Code can be found at https://github.com/INK-USC/GMED.
Open Datasets	Yes	We use six public CL datasets in our experiments. Split / Permuted / Rotated MNIST are constructed from the MNIST [21] dataset which contains images of handwritten digits. We also employ Split CIFAR-10 and Split CIFAR-100, which comprise of 5 and 20 disjoint subsets respectively based on their class labels. Similarly, Split mini-Image Net [2] splits the mini-Image Net [10, 42] dataset into 20 disjoint subsets based on their labels.
Dataset Splits	No	For all MNIST experiments, each task consists of 1,000 training examples following [2]. We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets.
Hardware Specification	No	Computational Efﬁciency. We analyze the additional forward and backward computation required by ER+GMED and MIR. Compared to ER, ER+GMED adds 3 forward and 1 backward passes to estimate loss increase, and 1 backward pass to update the example. In comparison, MIR adds 3 forward and 1 backward passes with 2 of the forward passes are over a larger set of retrieval candidates. In our experiments, we found GMED has similar training time cost as MIR. In Appendix B, we report the wall-clock time, and observe the run-time of ER+GMED is 1.5 times of ER.
Software Dependencies	No	For model architectures, we mostly follow the setup of [2]: for the three MNIST datasets, we use a MLP classiﬁer with 2 hidden layers with 400 hidden units each. For Split CIFAR-10, Split CIFAR-100 and Split mini-Image Net datasets, we use a Res Net-18 classiﬁer with three times less feature maps across all layers.
Experiment Setup	Yes	We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets. Following [8], we tune the hyper-parameters α (editing stride) and β (regularization strength) with only the ﬁrst three tasks. While γ (decay rate of the editing stride) is a hyper-parameter that may ﬂexibly control the deviation of edited examples from their original states, we ﬁnd γ=1.0 (i.e., no decay) leads to better performance in our experiments.