Learning to Branch for Multi-Task Learning

Authors: Pengsheng Guo, Chen-Yu Lee, Daniel Ulbricht

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed method on controlled synthetic data, Celeb A, and Taskonomy.
Researcher Affiliation Industry 1Apple. Correspondence to: Pengsheng Guo <pengsheng guo@apple.com>.
Pseudocode No The paper describes mathematical formulations but does not include any explicit pseudocode blocks or algorithms.
Open Source Code No No statement regarding the release or availability of open-source code for the methodology was found.
Open Datasets Yes We use the Celeb A dataset (Liu et al., 2015), which contains over 200K face images and each image contains 40 binary attribute annotations. We extend our method to the recent Taskonomy dataset (Zamir et al., 2018), which contains over 4.5 million indoor images from over 500 buildings.
Dataset Splits Yes The training, validation, and test sets contain 160K, 20K, and 20K images. We use the standard tiny split benchmark, which contains 275K training, 54K test, 52K validation images.
Hardware Specification Yes Our method leverages the effectiveness of gumbel-softmax so that every child node samples a single discrete action during the forward pass. Therefore, our network topological space is well maintained the tree does not grow exponentially with the number of tasks. As the result, it takes 10 hours to search the architecture and 11 hours to obtain the optimal weights for model (a) Learn To Branch-VGG, and it takes 4 hours to search the architecture and 10 hours to obtain the optimal weights for model (b) Learn To Branch Deep-Wide on a single 16GB Tesla GPU. On a single 32GB Tesla GPU, it takes 2 days to train the topology distribution and 3 days to obtain the final converged network.
Software Dependencies No The paper mentions using 'Adam solver' but does not specify version numbers for any software libraries, frameworks, or programming languages.
Experiment Setup Yes Learning rate is set to 10 3 for weight matrices and 10 7 for branching probability throughout the training. Temperature is set to 50 and decayed by the square root of the number of iterations. The networks are trained for 500 epochs with 50 epochs for warmup. We use the Adam optimizers with mini-batch size 64 to update both the weight matrices and the branching probabilities in our networks. Temperature is set to 10 and decayed by the number of epochs. We warmup the training for 2 epochs without updating the branching probabilities to ensure all weight matrices receive equal amounts of update gradients initially. Weight decay is set to 10 4 for all experiments.