Putting An End to End-to-End: Gradient-Isolated Learning of Representations
Authors: Sindy Löwe, Peter O'Connor, Bastiaan Veeling
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the applicability of the GIM approach to the visual and audio domain. As shown in Table 1, Greedy Info Max (GIM) outperforms its end-to-end trained CPC counterpart, despite its unsupervised features being optimized greedily without any backpropagation between modules. |
| Researcher Affiliation | Academia | AMLab University of Amsterdam |
| Pseudocode | No | The paper describes the approach using text and diagrams (Figure 1) but does not provide a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/loewe X/Greedy_Info Max. |
| Open Datasets | Yes | We focus on the STL-10 dataset [Coates et al., 2011] which provides an additional unlabeled training dataset. We follow the setup of Oord et al. [2018] unless specified otherwise and use a 100-hour subset of the publicly available Libri Speech dataset [Panayotov et al., 2015]. |
| Dataset Splits | Yes | The training curves of the two models as shown in Figure 3 provide some insight into this decreased performance. The learning curves of the first module (Figure 3a) reflect that there is no difference in its training in the two models. Modules two and three (Figures 3b and 3c), however, reveal a crucial difference. The iteratively trained modules show a larger divergence between the training and validation loss, indicating stronger overfitting. |
| Hardware Specification | No | The paper mentions 'GPU memory consumption' and 'GPU memory' in Table 2, but does not specify any concrete GPU models, CPU types, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions general software like 'Kaldi toolkit' but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For data augmentation, we take random 64 64 crops from the 96 96 images, flip horizontally with probability 0.5 and convert to grayscale. We divide each image of 64 64 pixels into a total of 7 7 local patches, each of size 16 16 with 8 pixels overlap. The patches are encoded by a Res Net-50 v2 model [He et al., 2016] without batch normalization [Ioffe and Szegedy, 2015]. We split the model into three gradient-isolated modules that we train in sync and with a constant learning rate. Remaining implementation details are presented in Appendix A.1. |