Gradient-based Editing of Memory Examples for Online Task-free Continual Learning
Authors: Xisen Jin, Arka Sadhu, Junyi Du, Xiang Ren
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments validate the effectiveness of GMED, and our best method significantly outperforms baselines and previous state-of-the-art on five out of six datasets. |
| Researcher Affiliation | Academia | Xisen Jin Arka Sadhu Junyi Du Xiang Ren University of Southern California {xisenjin, asadhu, junyidu, xiangren@usc.edu} |
| Pseudocode | Yes | Algorithm 1: Gradient Memory EDiting with ER (ER+GMED) |
| Open Source Code | Yes | 1Code can be found at https://github.com/INK-USC/GMED. |
| Open Datasets | Yes | We use six public CL datasets in our experiments. Split / Permuted / Rotated MNIST are constructed from the MNIST [21] dataset which contains images of handwritten digits. We also employ Split CIFAR-10 and Split CIFAR-100, which comprise of 5 and 20 disjoint subsets respectively based on their class labels. Similarly, Split mini-Image Net [2] splits the mini-Image Net [10, 42] dataset into 20 disjoint subsets based on their labels. |
| Dataset Splits | No | For all MNIST experiments, each task consists of 1,000 training examples following [2]. We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets. |
| Hardware Specification | No | Computational Efficiency. We analyze the additional forward and backward computation required by ER+GMED and MIR. Compared to ER, ER+GMED adds 3 forward and 1 backward passes to estimate loss increase, and 1 backward pass to update the example. In comparison, MIR adds 3 forward and 1 backward passes with 2 of the forward passes are over a larger set of retrieval candidates. In our experiments, we found GMED has similar training time cost as MIR. In Appendix B, we report the wall-clock time, and observe the run-time of ER+GMED is 1.5 times of ER. |
| Software Dependencies | No | For model architectures, we mostly follow the setup of [2]: for the three MNIST datasets, we use a MLP classifier with 2 hidden layers with 400 hidden units each. For Split CIFAR-10, Split CIFAR-100 and Split mini-Image Net datasets, we use a Res Net-18 classifier with three times less feature maps across all layers. |
| Experiment Setup | Yes | We set the size of replay memory as 10K for split CIFAR-100 and split mini-Image Net, and 500 for all remaining datasets. Following [8], we tune the hyper-parameters α (editing stride) and β (regularization strength) with only the first three tasks. While γ (decay rate of the editing stride) is a hyper-parameter that may flexibly control the deviation of edited examples from their original states, we find γ=1.0 (i.e., no decay) leads to better performance in our experiments. |