reproducibilityindex.ai

Learning to Branch for Multi-Task Learning

Authors: Pengsheng Guo, Chen-Yu Lee, Daniel Ulbricht

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the proposed method on controlled synthetic data, Celeb A, and Taskonomy.
Researcher Affiliation	Industry	1Apple. Correspondence to: Pengsheng Guo <pengsheng guo@apple.com>.
Pseudocode	No	The paper describes mathematical formulations but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	No	No statement regarding the release or availability of open-source code for the methodology was found.
Open Datasets	Yes	We use the Celeb A dataset (Liu et al., 2015), which contains over 200K face images and each image contains 40 binary attribute annotations. We extend our method to the recent Taskonomy dataset (Zamir et al., 2018), which contains over 4.5 million indoor images from over 500 buildings.
Dataset Splits	Yes	The training, validation, and test sets contain 160K, 20K, and 20K images. We use the standard tiny split benchmark, which contains 275K training, 54K test, 52K validation images.
Hardware Specification	Yes	Our method leverages the effectiveness of gumbel-softmax so that every child node samples a single discrete action during the forward pass. Therefore, our network topological space is well maintained the tree does not grow exponentially with the number of tasks. As the result, it takes 10 hours to search the architecture and 11 hours to obtain the optimal weights for model (a) Learn To Branch-VGG, and it takes 4 hours to search the architecture and 10 hours to obtain the optimal weights for model (b) Learn To Branch Deep-Wide on a single 16GB Tesla GPU. On a single 32GB Tesla GPU, it takes 2 days to train the topology distribution and 3 days to obtain the ﬁnal converged network.
Software Dependencies	No	The paper mentions using 'Adam solver' but does not specify version numbers for any software libraries, frameworks, or programming languages.
Experiment Setup	Yes	Learning rate is set to 10 3 for weight matrices and 10 7 for branching probability throughout the training. Temperature is set to 50 and decayed by the square root of the number of iterations. The networks are trained for 500 epochs with 50 epochs for warmup. We use the Adam optimizers with mini-batch size 64 to update both the weight matrices and the branching probabilities in our networks. Temperature is set to 10 and decayed by the number of epochs. We warmup the training for 2 epochs without updating the branching probabilities to ensure all weight matrices receive equal amounts of update gradients initially. Weight decay is set to 10 4 for all experiments.