Efficient Lifelong Learning with A-GEM
Authors: Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency. |
| Researcher Affiliation | Collaboration | 1University of Oxford, 2Facebook AI Research |
| Pseudocode | Yes | Algorithm 1 Learning and Evaluation Protocols |
| Open Source Code | Yes | 1The code is available at https://github.com/facebookresearch/agem. |
| Open Datasets | Yes | Permuted MNIST (Kirkpatrick et al., 2016) is a variant of MNIST (Le Cun, 1998) dataset of handwritten digits where each task has a certain random permutation of the input pixels which is applied to all the images of that task. Split CIFAR (Zenke et al., 2017) consists of splitting the original CIFAR-100 dataset (Krizhevsky & Hinton, 2009) into 20 disjoint subsets... |
| Dataset Splits | Yes | As described in Sec. 2 and outlined in Alg. 1, in order to cross validate we use the first 3 tasks, and then report metrics on the remaining 17 tasks after doing a single training pass over each task in sequence. |
| Hardware Specification | No | The paper mentions "The timing refers to training time on a GPU device" in Table 7, but does not specify the model or any other hardware components like CPU or memory. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions). |
| Experiment Setup | Yes | In terms of architectures, we use a fully-connected network with two hidden layers of 256 ReLU units each for Permuted MNIST, a reduced ResNet18 for Split CIFAR like in Lopez-Paz & Ranzato (2017), and a standard ResNet18 (He et al., 2016) for Split CUB and Split AWA. For a given dataset stream, all models use the same architecture, and all models are optimized via stochastic gradient descent with mini-batch size equal to 10. [...] Below we report the hyper-parameters grid considered for different experiments. ... The best setting for each experiment is reported in the parenthesis. |