Lifelong Learning of Compositional Structures
Authors: Jorge A Mendez, ERIC EATON
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our framework with the three compositional structures of Section 4.1. All methods assimilate task T (t) via backpropagation on the structure s parameters ψ(t). For each, we trained three instantiations of Algorithm 1, varying the method used for adaptation: Na ıve fine-tuning (NFT) updates components via standard backpropagation, ignoring past tasks. Elastic weight consolidation (EWC, Kirkpatrick et al., 2017) penalizes modifying model parameters via λ 2 PT 1 t=1 θ θ(t) 2F (t) , where F (t) is the Fisher information around θ(t). Backpropagation is carried out on the regularized loss, and we approximated F (t) with Kronecker factors. Experience replay (ER) stores nm samples per task in a replay buffer, and during adaptation takes backpropagation steps with data from the replay buffer along with the current task s data. We explored variations with and without the expansion step: dynamic + compositional methods use component dropout to add new components, while compositional methods keep a fixed-size set. Baselines For every adaptation method listed above, we constructed two baselines. Joint baselines use compositional structures, but do not separate assimilation and accommodation, and instead update components and structures jointly. In contrast, no-components baselines optimize a single architecture to be used for all tasks, with additional task-specific input and output mappings, E(t) and D(t). The latter baselines correspond to the most common lifelong learning approach, which learns a monolithic structure shared across tasks, while the former are the na ıve extensions of those methods to a compositional setting. We also trained an ablated version of our framework that keeps all components fixed after initialization (FM), only taking assimilation steps for each new task. 5.2 RESULTS We evaluated these methods on tasks with no evident compositional structure to demonstrate that there is no strict requirement for a certain type of compositionality. Appendix D introduces a simple compositional data set, and shows that our results naturally extend to that setting. We repeated experiments ten times with varying random seeds. For details on data sets and hyper-parameters, see Appendix E. Code and data sets are available at https://github.com/Lifelong-ML/ Mendez2020Compositional.git. Additional results, beyond those presented in this section, are given in Appendix F. 5.2.1 LINEAR COMBINATIONS OF MODELS We first evaluated linear combinations of models on three data sets used previously for evaluating linear lifelong learning (Ruvolo & Eaton, 2013). The Facial Recognition (FERA) data set tasks involve recognizing one of three facial expression action units for one of seven people, for a total of T = 21 tasks. The Landmine data set consists of T = 29 tasks, which require detecting land mines in radar images from different regions. Finally, the London Schools (Schools) data set contains T = 139 regression tasks, each corresponding to exam score prediction in a different school. Table 2 summarizes the results obtained with linear models. The compositional versions of ER, EWC, and NFT clearly outperformed all the joint versions, which learn the same form of models but by jointly optimizing structures and components. This suggests that the separation of the learning process into assimilation and accommodation stages enables the agent to better capture the structure of the problem. Interestingly, the no-components variants, which learn a single linear model for all tasks, performed better than the jointly trained versions in two out of the three data sets, and even outperformed our compositional algorithms in one. This indicates that the tasks in those two data sets (Landmine and Schools) are so closely related that a single model can capture them. |
| Researcher Affiliation | Academia | Jorge A. Mendez and Eric Eaton Department of Computer and Information Science University of Pennsylvania {mendezme,eeaton}@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 Lifelong Comp. Learning |
| Open Source Code | Yes | Code and data sets are available at https://github.com/Lifelong-ML/Mendez2020Compositional.git. |
| Open Datasets | Yes | We evaluated these methods on tasks with no evident compositional structure to demonstrate that there is no strict requirement for a certain type of compositionality. Appendix D introduces a simple compositional data set, and shows that our results naturally extend to that setting. We repeated experiments ten times with varying random seeds. For details on data sets and hyper-parameters, see Appendix E. Code and data sets are available at https://github.com/Lifelong-ML/ Mendez2020Compositional.git. Additional results, beyond those presented in this section, are given in Appendix F. 5.2.1 LINEAR COMBINATIONS OF MODELS We first evaluated linear combinations of models on three data sets used previously for evaluating linear lifelong learning (Ruvolo & Eaton, 2013). The Facial Recognition (FERA) data set tasks involve recognizing one of three facial expression action units for one of seven people, for a total of T = 21 tasks. The Landmine data set consists of T = 29 tasks, which require detecting land mines in radar images from different regions. Finally, the London Schools (Schools) data set contains T = 139 regression tasks, each corresponding to exam score prediction in a different school. Table 2 summarizes the results obtained with linear models. The compositional versions of ER, EWC, and NFT clearly outperformed all the joint versions, which learn the same form of models but by jointly optimizing structures and components. This suggests that the separation of the learning process into assimilation and accommodation stages enables the agent to better capture the structure of the problem. Interestingly, the no-components variants, which learn a single linear model for all tasks, performed better than the jointly trained versions in two out of the three data sets, and even outperformed our compositional algorithms in one. This indicates that the tasks in those two data sets (Landmine and Schools) are so closely related that a single model can capture them. 5.2.2 DEEP COMPOSITIONAL LEARNING WITH SOFT LAYER ORDERING We then evaluated how the different algorithms performed when learning deep nets with soft layer ordering, using five data sets. Binary MNIST (MNIST) is a common lifelong learning benchmark, where each task is a binary classification problem between a pair of digits. We constructed T = 10 tasks by randomly sampling the digits with replacement across tasks. The Binary Fashion MNIST (Fashion) data set is similar to MNIST, but images correspond to items of clothing. For these two data sets, all models used a task-specific input transformation layer E(t) initialized at random and kept fixed throughout training, to ensure that the input spaces were sufficiently different (Meyerson & Miikkulainen, 2018). A more complex lifelong learning problem commonly used in the literature is Split CUB-200 (CUB), where the agent must classify bird species. We created T = 20 tasks by randomly sampling ten species for each, without replacement across tasks. All agents used a frozen Res Net-18 pre-trained on Image Net as a feature extractor E(t) shared across all tasks. For these first three data sets, all architectures were fully connected networks. To show that our framework supports more complex convolutional architectures, we used two additional data sets. We constructed a lifelong learning version of CIFAR-100 (CIFAR) with T = 20 tasks by randomly sampling five classes per task, without replacement across tasks. Finally, we used the Omniglot data set, which consists of T = 50 multi-class classification problems, each corresponding to detecting handwritten symbols in a given alphabet. |
| Dataset Splits | Yes | For these four data sets, we used the standard train/test split, and further divided the training set into 80% for training and 20% for validation. Finally, for Omniglot, we used each alphabet as one task, and split the data into 80% for training, 10% for validation, and 10% for test, for each task. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used, such as GPU models, CPU specifications, or cloud computing instances. |
| Software Dependencies | No | The paper mentions 'Py Torch' as providing a pre-trained model but does not specify its version number or any other software dependencies with version information. |
| Experiment Setup | Yes | All agents trained for 100 epochs on each task, with a mini-batch of 32 samples. Compositional agents used the first 99 epochs solely for assimilation and the last epoch for adaptation. Dynamic + compositional agents followed this same process, but every assimilation step was done via component dropout; after the adaptation step, the agent kept the new component if its validation performance with the added component represented at least a 5% relative improvement over the performance without the additional component. Joint agents trained all components and the structure for the current task jointly during all 100 epochs, keeping the structure for the previous tasks fixed, while no-components agents trained the whole model at every epoch. ER-based algorithms used a replay buffer of a single mini-batch per task. Similarly, EWC-based algorithms used a single mini-batch to compute the approximate Fisher information matrix required for regularization, and used a fixed regularization parameter λ = 10 3. |