In Defense of the Unitary Scalarization for Deep Multi-Task Learning
Authors: Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, Pawan K Mudigonda
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A comprehensive experimental evaluation ( 4) of recent SMTOs on popular multi-task benchmarks, showing that no SMTO consistently outperforms unitary scalarization in spite of the added complexity and overhead. In particular, either the differences between unitary scalarization and SMTOs are not statistically significant, or they can be bridged by standard regularization and stabilization techniques from the single-task literature. Our reinforcement learning (RL) experiments include optimizers previously applied only to supervised learning. |
| Researcher Affiliation | Academia | Vitaly Kurin University of Oxford vitaly.kurin@cs.ox.ac.uk Alessandro De Palma University of Oxford adepalma@robots.ox.ac.uk Ilya Kostrikov University of California, Berkeley New York University Shimon Whiteson University of Oxford M. Pawan Kumar University of Oxford |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | Yes | Code to reproduce the experiments, including a unified Py Torch [50] implementation of the considered SMTOs, is available at https://github.com/yobibyte/ unitary-scalarization-dmtl. |
| Open Datasets | Yes | We present results on the Multi-MNIST [54] dataset, a simple two-task supervised learning benchmark. We now show results for the Celeb A [44] dataset, a challenging 40-task multi-label classification problem. In order to complement the multi-task classification experiments for Multi-MNIST and Celeb A, we present results for Cityscapes [13], a dataset for semantic understanding of urban street scenes. For RL experiments, we use Meta-World [65] and the Soft Actor-Critic [20] implementation from [55]. |
| Dataset Splits | Yes | Surprisingly, several MTL works [11, 40, 42, 66] report validation results, making it easier to overfit. Instead, following standard machine learning practice, we select a model on the validation set, and later report test metrics for all benchmarks. Validation results are also available in appendix D. Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details. |
| Hardware Specification | No | The paper states: 'We used a single NVIDIA GeForce RTX 3090 GPU for all supervised learning experiments. For RL experiments, we used a shared cluster with NVIDIA Tesla V100 GPUs.' This provides GPU models, but lacks details like CPU, memory, or specific count of GPUs, which are necessary for full reproducibility. |
| Software Dependencies | No | The paper mentions 'Py Torch [50]' but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details. We tuned ℓ2 regularization terms λ for all SMTOs in the following grid: λ {0, 10 4, 10 3}. The best validation performance was attained with λ = 10 3 for unitary scalarization, IMTL and PCGrad, and with λ = 10 4 for MGDA, Grad Drop, and RLW. Validation performance was further stabilized by the addition of several dropout layers (see Figure 5), with dropout probabilities from 0.25 to 0.5. |