Continual learning with the neural tangent ensemble

Authors: Ari Benjamin, Christian-Gernot Pehle, Kyle Daruwalla

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify these predictions on the Permuted MNIST task with MLPs and on the task-incremental CIFAR100 with modern CNN architectures. In the Permuted MNIST task, an MLP with 10 output units is tasked with repeatedly classifying MNIST, but in each task the pixels are shuffled with a new static permutation. In task-incremental CIFAR100, a convolutional net with 100 output units sees only 10 classes each task.
Researcher Affiliation Academia Ari S. Benjamin Christian Pehle Kyle Daruwalla Cold Spring Harbor Laboratory Cold Spring Harbor, NY 11724 {benjami,pehle,daruwal}@cshl.edu
Pseudocode Yes Pseudocode for the resulting algorithm is in the Appendix 1. We also display the result of sweeps over β and η on the Permuted MNIST task in Fig. 7. (And in Appendix 8.3: Algorithm 1 Neural Tangent Ensemble posterior update rule with current gradients)
Open Source Code Yes The code for all figures in this paper were written in Jax and are available at https://github.com/ZadorLaboratory/NeuralTangentEnsemble.
Open Datasets Yes Below, we verify these predictions on the Permuted MNIST task with MLPs and on the task-incremental CIFAR100 with modern CNN architectures.
Dataset Splits No The paper mentions 'test accuracy' and 'test set' for evaluation but does not explicitly detail the train/validation/test dataset splits (e.g., percentages or sample counts) needed to reproduce the data partitioning.
Hardware Specification Yes All MNIST experiments were completed on two NVIDIA RTX 6000 cards, and all CIFAR100 experiments were conducted on NVIDIA H100 cards.
Software Dependencies No The paper states the code was written in Jax but does not provide specific version numbers for Jax or any other software dependencies.
Experiment Setup Yes A single MLP was trained with 1,000 hidden units per layer and 2 hidden layers using Re LU nonlinearities. The batch size was 24 and the parameters of the NTE algorithm were η = 0.01 and β = 1. (and) We used SGD with batch size 128, learning rate 0.01, and momentum swept from 0 to 1.