Overcoming Catastrophic Forgetting with Hard Attention to the Task

Authors: Joan Serra, Didac Suris, Marius Miron, Alexandros Karatzoglou

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HAT in the context of image classification, using what we believe is a high-standard evaluation protocol: we consider random sequences of 8 publicly-available data sets representing different tasks, and compare with a dozen of recent competitive approaches. We show favorable results in 4 different experimental setups, cutting current rates by 45 to 80%. We now move to the HAT results. First of all, we observe that HAT consistently performs better than all considered baselines for all t >= 2 (Fig. 3). For the case of t = 2, it obtains an average forgetting ratio ρ 2 = 0.02, while the best baseline is EWC with ρ 2 = 0.08 (Table 1). For the case of t = 8, HAT obtains ρ 8 = 0.06, while the best baseline is PNN with ρ 8 = 0.11. This implies a reduction in forgetting of 75% for t = 2 and 45% for t = 8.
Researcher Affiliation Collaboration 1Telef onica Research, Barcelona, Spain 2Universitat Polit ecnica de Catalunya, Barcelona, Spain 3Universitat Pompeu Fabra, Barcelona, Spain. Correspondence to: Joan Serr a <joan.serra@telefonica.com>
Pseudocode No The paper describes its methods in text and using mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 3) and a schematic diagram (Fig. 1), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes We make our code publicly-available1. 1https://github.com/joansj/hat
Open Datasets Yes The considered data sets are: CIFAR10 and CIFAR100 (Krizhevsky, 2009), Face Scrub (Ng & Winkler, 2014), Fashion MNIST (Xiao et al., 2017), Not MNIST (Bulatov, 2011), MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), and Traffic Signs (Stallkamp et al., 2011). For further details on data we refer to Supplementary Materials.
Dataset Splits Yes For each task, we ran-domly split 15% of the training set and keep it for validation purposes. We repeat 10 times this sequential train/test procedure with 10 different seed numbers, which are also used in the rest of randomizations and initializations (see below).
Hardware Specification No The paper describes the network architecture (AlexNet-like) but does not specify any hardware details (e.g., GPU/CPU models, memory, or cloud resources) used for running the experiments.
Software Dependencies Yes Unless stated otherwise, our code uses Py Torch s defaults for version 0.2.0 (Paszke et al., 2017).
Experiment Setup Yes Unless stated otherwise, we employ an Alex Net-like architecture... We use rectified linear units as activations, and 2x2 max-pooling after the convolutional layers. We also use a dropout of 0.2 for the first two layers and of 0.5 for the rest. A fully-connected layer with a softmax output is used as a final layer, together with categorical cross entropy loss. All layers are randomly initialized with Xavier uniform initialization... We train all models with backpropagation and plain SGD, using a learning rate of 0.05, and decaying it by a factor of 3... We stop training when we reach a learning rate lower than 10^-4 or we have iterated over 200 epochs... Batch size is set to 64. Unless stated otherwise, we use smax = 400 and c = 0.75.