Revisiting Scalarization in Multi-Task Learning: A Theoretical Perspective

Authors: Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, Han Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We additionally perform experiments on a real-world dataset using both scalarization and state-of-the-art SMTOs. The experimental results not only corroborate our theoretical findings, but also unveil the potential of SMTOs in finding balanced solutions, which cannot be achieved by scalarization.
Researcher Affiliation Academia Yuzheng Hu1 Ruicheng Xian1 Qilong Wu1 Qiuling Fan2 Lang Yin1 Han Zhao1 1Department of Computer Science 2Department of Mathematics University of Illinois Urbana-Champaign
Pseudocode Yes Algorithm 1: An O(k2) algorithm of checking C1
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes We use the SARCOS dataset for our experiment (Vijayakumar and Schaal, 2000), where the problem is to predict the torque of seven robot arms given inputs that consist of the position, velocity, and acceleration of the respective arms.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No Our code is based on the released implementation5 of MGDA-UB, which also includes the code for MGDA. We apply their code on the SARCOS dataset. For both methods, we use vanilla gradient descent with a learning rate of 0.5 for 100 epochs, following the default choice in the released implementation.
Experiment Setup Yes Our regression model is a two-layer linear network with hidden size q = 1 (no bias). To explore the portion of the Pareto front achievable by linear scalarization, we fit 100,000 linear regressors with randomly sampled convex coefficients and record their performance... For both methods, we use vanilla gradient descent with a learning rate of 0.5 for 100 epochs, following the default choice in the released implementation. We comment that early stopping can also be adopted, i.e., terminate once the minimum norm of the convex hull of gradients is smaller than a threshold, for which we set to be 10 3.