In Defense of the Unitary Scalarization for Deep Multi-Task Learning

Authors: Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, Pawan K Mudigonda

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A comprehensive experimental evaluation ( 4) of recent SMTOs on popular multi-task benchmarks, showing that no SMTO consistently outperforms unitary scalarization in spite of the added complexity and overhead. In particular, either the differences between unitary scalarization and SMTOs are not statistically significant, or they can be bridged by standard regularization and stabilization techniques from the single-task literature. Our reinforcement learning (RL) experiments include optimizers previously applied only to supervised learning.
Researcher Affiliation Academia Vitaly Kurin University of Oxford vitaly.kurin@cs.ox.ac.uk Alessandro De Palma University of Oxford adepalma@robots.ox.ac.uk Ilya Kostrikov University of California, Berkeley New York University Shimon Whiteson University of Oxford M. Pawan Kumar University of Oxford
Pseudocode No No pseudocode or algorithm blocks found.
Open Source Code Yes Code to reproduce the experiments, including a unified Py Torch [50] implementation of the considered SMTOs, is available at https://github.com/yobibyte/ unitary-scalarization-dmtl.
Open Datasets Yes We present results on the Multi-MNIST [54] dataset, a simple two-task supervised learning benchmark. We now show results for the Celeb A [44] dataset, a challenging 40-task multi-label classification problem. In order to complement the multi-task classification experiments for Multi-MNIST and Celeb A, we present results for Cityscapes [13], a dataset for semantic understanding of urban street scenes. For RL experiments, we use Meta-World [65] and the Soft Actor-Critic [20] implementation from [55].
Dataset Splits Yes Surprisingly, several MTL works [11, 40, 42, 66] report validation results, making it easier to overfit. Instead, following standard machine learning practice, we select a model on the validation set, and later report test metrics for all benchmarks. Validation results are also available in appendix D. Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details.
Hardware Specification No The paper states: 'We used a single NVIDIA GeForce RTX 3090 GPU for all supervised learning experiments. For RL experiments, we used a shared cluster with NVIDIA Tesla V100 GPUs.' This provides GPU models, but lacks details like CPU, memory, or specific count of GPUs, which are necessary for full reproducibility.
Software Dependencies No The paper mentions 'Py Torch [50]' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details. We tuned ℓ2 regularization terms λ for all SMTOs in the following grid: λ {0, 10 4, 10 3}. The best validation performance was attained with λ = 10 3 for unitary scalarization, IMTL and PCGrad, and with λ = 10 4 for MGDA, Grad Drop, and RLW. Validation performance was further stabilized by the addition of several dropout layers (see Figure 5), with dropout probabilities from 0.25 to 0.5.