reproducibilityindex.ai

In Defense of the Unitary Scalarization for Deep Multi-Task Learning

Authors: Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, Pawan K Mudigonda

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A comprehensive experimental evaluation ( 4) of recent SMTOs on popular multi-task benchmarks, showing that no SMTO consistently outperforms unitary scalarization in spite of the added complexity and overhead. In particular, either the differences between unitary scalarization and SMTOs are not statistically signiﬁcant, or they can be bridged by standard regularization and stabilization techniques from the single-task literature. Our reinforcement learning (RL) experiments include optimizers previously applied only to supervised learning.
Researcher Affiliation	Academia	Vitaly Kurin University of Oxford vitaly.kurin@cs.ox.ac.uk Alessandro De Palma University of Oxford adepalma@robots.ox.ac.uk Ilya Kostrikov University of California, Berkeley New York University Shimon Whiteson University of Oxford M. Pawan Kumar University of Oxford
Pseudocode	No	No pseudocode or algorithm blocks found.
Open Source Code	Yes	Code to reproduce the experiments, including a uniﬁed Py Torch [50] implementation of the considered SMTOs, is available at https://github.com/yobibyte/ unitary-scalarization-dmtl.
Open Datasets	Yes	We present results on the Multi-MNIST [54] dataset, a simple two-task supervised learning benchmark. We now show results for the Celeb A [44] dataset, a challenging 40-task multi-label classiﬁcation problem. In order to complement the multi-task classiﬁcation experiments for Multi-MNIST and Celeb A, we present results for Cityscapes [13], a dataset for semantic understanding of urban street scenes. For RL experiments, we use Meta-World [65] and the Soft Actor-Critic [20] implementation from [55].
Dataset Splits	Yes	Surprisingly, several MTL works [11, 40, 42, 66] report validation results, making it easier to overﬁt. Instead, following standard machine learning practice, we select a model on the validation set, and later report test metrics for all benchmarks. Validation results are also available in appendix D. Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details.
Hardware Specification	No	The paper states: 'We used a single NVIDIA GeForce RTX 3090 GPU for all supervised learning experiments. For RL experiments, we used a shared cluster with NVIDIA Tesla V100 GPUs.' This provides GPU models, but lacks details like CPU, memory, or specific count of GPUs, which are necessary for full reproducibility.
Software Dependencies	No	The paper mentions 'Py Torch [50]' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup	Yes	Appendix C.1 reports dataset descriptions, the computational setup, hyperparameter and tuning details. We tuned ℓ2 regularization terms λ for all SMTOs in the following grid: λ {0, 10 4, 10 3}. The best validation performance was attained with λ = 10 3 for unitary scalarization, IMTL and PCGrad, and with λ = 10 4 for MGDA, Grad Drop, and RLW. Validation performance was further stabilized by the addition of several dropout layers (see Figure 5), with dropout probabilities from 0.25 to 0.5.