Fortuitous Forgetting in Connectionist Networks
Authors: Hattie Zhou, Ankit Vani, Hugo Larochelle, Aaron Courville
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the power of this perspective by showing that we can significantly improve upon existing work in their respective settings through the use of more targeted forgetting operations. Our analysis in Section 5 sheds insight on the mechanism through which iterative training leads to parameter values with better generalization properties. |
| Researcher Affiliation | Collaboration | Hattie Zhou Mila, Universit e de Montr eal Ankit Vani Mila, Universit e de Montr eal Hugo Larochelle Mila, CIFAR Fellow, Google Research, Brain Team Aaron Courville Mila, Universit e de Montr eal, CIFAR Fellow |
| Pseudocode | No | The paper describes methods and processes in detail, including mathematical formulations and textual descriptions of algorithms like IMP or KE, but it does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We make our code available at https://github.com/hlml/fortuitous_forgetting. |
| Open Datasets | Yes | Flower (Nilsback & Zisserman, 2008) 102 1020 1020 6149 8189 CUB (Wah et al., 2011) 200 5994 N/A 5794 11788 Aircraft (Maji et al., 2013) 100 3334 3333 3333 10000 MIT67 (Quattoni & Torralba, 2009) 67 5360 N/A 1340 6700 Stanford-Dogs (Khosla et al., 2011) 120 12000 N/A 8580 20580 |
| Dataset Splits | Yes | Table A6: Summary of the five datasets used in Section 4.1, adopted from Taha et al. (2021). Classes Train Size Val Size Test Size Total Size Flower (Nilsback & Zisserman, 2008) 102 1020 1020 6149 8189 |
| Hardware Specification | No | The paper provides details on the training process, such as optimizer, momentum, weight decay, learning rate schedule, batch size, and number of epochs ('All networks are trained with stochastic gradient descent (SGD) with momentum of 0.9 and weight decay of 10-4. We also use a cosine learning rate schedule (Loshchilov & Hutter, 2017). ... Models are trained with a batch size of 32 for 200 epochs each generation.'), but it does not specify any hardware components like GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer (Kingma & Ba, 2014)' and implicitly uses deep learning frameworks (likely PyTorch, given the community and affiliations), but it does not specify any version numbers for these software dependencies (e.g., 'PyTorch 1.x', 'Python 3.x'). |
| Experiment Setup | Yes | All networks are trained with stochastic gradient descent (SGD) with momentum of 0.9 and weight decay of 10 4. We also use a cosine learning rate schedule (Loshchilov & Hutter, 2017). Taha et al. (2021) use an initial learning rate of 0.256, but we find that a smaller learning rate than what is used in Taha et al. (2021) to be beneficial for certain datasets, thus we consider a learning rate in {0.1, 0.256} for all experiments and report the setting with the better validation performance. Models are trained with a batch size of 32 for 200 epochs each generation. |