Revisiting Scalarization in Multi-Task Learning: A Theoretical Perspective
Authors: Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, Han Zhao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We additionally perform experiments on a real-world dataset using both scalarization and state-of-the-art SMTOs. The experimental results not only corroborate our theoretical findings, but also unveil the potential of SMTOs in finding balanced solutions, which cannot be achieved by scalarization. |
| Researcher Affiliation | Academia | Yuzheng Hu1 Ruicheng Xian1 Qilong Wu1 Qiuling Fan2 Lang Yin1 Han Zhao1 1Department of Computer Science 2Department of Mathematics University of Illinois Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1: An O(k2) algorithm of checking C1 |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | Yes | We use the SARCOS dataset for our experiment (Vijayakumar and Schaal, 2000), where the problem is to predict the torque of seven robot arms given inputs that consist of the position, velocity, and acceleration of the respective arms. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | Our code is based on the released implementation5 of MGDA-UB, which also includes the code for MGDA. We apply their code on the SARCOS dataset. For both methods, we use vanilla gradient descent with a learning rate of 0.5 for 100 epochs, following the default choice in the released implementation. |
| Experiment Setup | Yes | Our regression model is a two-layer linear network with hidden size q = 1 (no bias). To explore the portion of the Pareto front achievable by linear scalarization, we fit 100,000 linear regressors with randomly sampled convex coefficients and record their performance... For both methods, we use vanilla gradient descent with a learning rate of 0.5 for 100 epochs, following the default choice in the released implementation. We comment that early stopping can also be adopted, i.e., terminate once the minimum norm of the convex hull of gradients is smaller than a threshold, for which we set to be 10 3. |