Efficiently Identifying Task Groupings for Multi-Task Learning

Authors: Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, Chelsea Finn

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the large-scale Taskonomy computer vision dataset, we find this method can decrease test loss by 10.0% compared to simply training all tasks together while operating 11.6 times faster than a state-of-the-art task grouping method. Our empirical findings are summarized in Figure 3. For our task grouping evaluation, we compare two classes of approaches: approaches that determine task groupings, and approaches that train on all tasks together but alter the optimization. We evaluate the capacity of TAG to select task groupings on Celeb A, a large-scale face attributes dataset [38] and Taskonomy, a massive computer vision dataset of indoor scenes [54]. Following this analysis, we direct our focus towards answering the following questions with ablation experiments on Celeb A.
Researcher Affiliation Collaboration Christopher Fifty1, Ehsan Amid1, Zhe Zhao1, Tianhe Yu1,2, Rohan Anil1, Chelsea Finn1,2 Google Research, Brain Team1, Stanford University2 cfifty@google.com
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for a procedure.
Open Source Code Yes Our code is available at github.com/google-research/google-research/tree/master/tag.
Open Datasets Yes We evaluate the capacity of TAG to select task groupings on Celeb A, a large-scale face attributes dataset [38] and Taskonomy, a massive computer vision dataset of indoor scenes [54]. [38] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015. [54] Amir R. Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. 2018 IEEE/CVF Conference, Jun 2018.
Dataset Splits Yes We select a subset of 9 attributes {a1, a2, a3, a4, a5, a6, a7, a8, a9} from the 40 possible attributes in Celeb A and optimize the baseline MTL model by tuning architecture, batch size, and learning rate to maximize the performance of training all tasks together on the validation set. Is measuring the change in train loss comparable with the change in validation loss when computing inter-task affinity? To our surprise, the inter-task affinity scores computed on the validation set are very similar to the inter-task affinity scores computed on the training set (Pearson s Coefficient: 0.9804).
Hardware Specification Yes All models were run on a Tesla V100 instance with the time to train the full MTL model being approximately 83 minutes in Celeb A and 146 hours in Taskonomy. To put this cost into perspective, on an 8-GPU, on-demand p3.16xlarge AWS instance, the difference in monetary expenditure between TAG and HOA would be $6,144.48.
Software Dependencies No The paper mentions software like TensorFlow [1] and Keras [12] in its references, but it does not provide specific version numbers for any software dependencies required to reproduce the experiments in the main text.
Experiment Setup Yes We select a subset of 9 attributes {a1, a2, a3, a4, a5, a6, a7, a8, a9} from the 40 possible attributes in Celeb A and optimize the baseline MTL model by tuning architecture, batch size, and learning rate to maximize the performance of training all tasks together on the validation set. We do not tune other methods with the exception of Grad Norm, for which we search over {0.1, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0} for alpha. For task grouping algorithms, we evaluate the set of {2-splits, 3-splits, 4-splits} inference-time memory budgets.