Information-theoretic Online Memory Selection for Continual Learning

Authors: Shengyang Sun, Daniele Calandriello, Huiyi Hu, Ang Li, Michalis Titsias

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that the proposed information-theoretic criteria encourage to select representative memories for learning the underlying function. We also conduct standard continual learning benchmarks and demonstrate the advantage of our proposed reservoir sampler over strong GCL baselines at various levels of data imbalance.
Researcher Affiliation Collaboration Shengyang Sun 1, Daniele Calandriello4, Huiyi Hu 2, Ang Li 3, Michalis K. Titsias4 1University of Toronto, 1Vector Institute, 2Google Brain, 3Baidu Apollo, 4Deep Mind
Pseudocode Yes Algorithm 1 Information-theoretic Reservoir Sampling (Info RS)... Algorithm 2 Information-theoretic Greedy Selection (Info GS)... Algorithm 3 Reservoir Sampling (Vitter, 1985)... Algorithm 4 Weighted Reservoir Sampling (Chao, 1982; Efraimidis & Spirakis, 2006)... Algorithm 5 Class-Balanced Reservoir Sampling (Chrysakis & Moens, 2020)
Open Source Code No The paper includes a Reproducibility Statement that details pseudocodes and hyper-parameters, but it does not provide any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes The benchmarks involve Permuted MNIST, Split MNIST, Split CIFAR10, and Split Mini Image Net.
Dataset Splits Yes To tune the hyper-parameters, we pick 10% of training data as the validation set, then we pick the best hyper-parameter based on the averaged validation accuracy over 5 random seeds.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models or configurations) used for running the experiments, only general model architectures are mentioned.
Software Dependencies No The paper mentions software components like 'stochastic gradient descent optimizer' and refers to a model implementation (DER++), but it does not provide specific version numbers for any software, libraries, or frameworks used (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1').
Experiment Setup Yes To tune the hyper-parameters, we pick 10% of training data as the validation set, then we pick the best hyper-parameter based on the averaged validation accuracy over 5 random seeds. The tuning hyper-parameters include the learning rate lr, the logit regularization coefficient α, the target regularization coefficient β, the learnability ratio η, and the information thresholding ratio γi, if needed. We present the detailed hyper-parameters in Table 1.