Meta-Consolidation for Continual Learning

Authors: Joseph K J, Vineeth N Balasubramanian

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with continual learning benchmarks of MNIST, CIFAR-10, CIFAR-100 and Mini Image Net datasets show consistent improvement over five baselines, including a recent state-of-the-art, corroborating the promise of MERLIN.
Researcher Affiliation Academia K J Joseph and Vineeth N Balasubramanian Department of Computer Science and Engineering Indian Institute of Technology Hyderabad, India {cs17m18p100001,vineethnb}@iith.ac.in
Pseudocode Yes Algorithm 1 MERLIN: Overall Methodology; Algorithm 2 META-CONSOLIDATION IN MERLIN; Algorithm 3 MERLIN INFERENCE
Open Source Code Yes Our code1 is implemented in Py Torch [60] and runs on a single NVIDIA V-100 GPU. (Footnote: 1https://github.com/Joseph KJ/merlin)
Open Datasets Yes Five standard continual learning benchmarks, viz. Split MNIST [13], Permuted MNIST [88], Split CIFAR-10 [88], Split CIFAR-100 [63] and Split Mini-Imagenet [15], are used in the experiments, following recent continual learning literature [12, 4, 63, 51, 13].
Dataset Splits No While Section 3 generally mentions 'training, validation and test samples' for a task, the specific experimental setup in Section 4.1.1 (Datasets) only details the use of training and test sets for the listed benchmarks (e.g., '1000 images per task for training and the model is evaluated on the all test examples'). No specific validation splits or how they were used for reproduction are provided.
Hardware Specification Yes Our code1 is implemented in Py Torch [60] and runs on a single NVIDIA V-100 GPU.
Software Dependencies No The paper mentions 'Py Torch [60]' but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup Yes For the MNIST dataset, we use a two-layer fully connected neural network with 100 neurons each, with Re LU activation... batch size is set to 10 and Adam [35] is used as the optimizer, with an initial learning rate of 0.001 and weight decay of 0.001. ... trained only for a single epoch... We use a chunk size of 300 for all experiments... Ada Grad [20] is used as the optimizer with an initial learning rate of 0.001. Batch size is set to 1 and the VAE network is trained for 25 epochs. At test time, we sample 30 models from the trained decoder.