Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks

Authors: Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1 shows that gradient-based multi-task learning with a two-layer Re LU NN with first-layer parameters (the representation) shared among all tasks and last-layer weights (the head) learned uniquely for each task indeed recovers the ground-truth subspace with error diminishing with the number of tasks. We theoretically justify this observation, providing the first known proofs of multi-task feature learning with a nonlinear model along with a new explanation for why multi-tasking aids feature learning. Our theoretical contributions are summarized below, and verified numerically in Appendix F. F. Numerical Simulations In this section we verify our analysis with numerical simulations. We aim to both confirm that the alternating stochasticgradient descent algorithm for multi-task pretraining that we study recovers the ground-truth representation and further explore the mechanisms by which it does so. To this end, all experiments are conducted on synthetic data generated according to the model described in Section 2.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, Texas, USA 2Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA 3Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, California, USA.
Pseudocode No The paper describes the algorithm steps in paragraph form, but no formal pseudocode or algorithm block is provided.
Open Source Code No No explicit statement or link regarding the release of source code for the described methodology was found in the paper.
Open Datasets No All experiments are conducted on synthetic data generated according to the model described in Section 2. To generate M, we sample each of its elements independently from the standard normal distribution, then orthonormalize its rows via a QR decomposition.
Dataset Splits No The paper discusses training and testing, but there is no explicit mention or description of a separate validation set or splitting methodology for validation data.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned in the paper.
Software Dependencies No The paper does not provide specific software dependency details with version numbers (e.g., libraries, frameworks, or solvers).
Experiment Setup Yes Unless otherwise noted, we used λw = 0.05, λa = 0.5 and η = 0.1 (learning rate for both ai and W). after tuning each parameter in the set {0.01, 0.05, 0.1, 0.5, 1}, separately for single task and multi-task cases, unless r = 4. We tuned νw {0.001, 0.01, 0.1, 1}, and used νw = 0.01 for r 3, unless otherwise noted. For r = 4, we found that setting λw = 0.1 and νw = 0.001 improved performance, but did not see improvement by changing the other parameters, so kept them the same. We used a smaller learning rate of 0.001 for the bias in all cases, although we reset the bias randomly before downstream evaluation. Then, we run gradient descent on the regularized empirical hinge loss on a fixed set of N samples to learn the last layer head, i.e. linear probing. We run this gradient descent with step size η = 0.1 and ℓ2 regularizer ˆλa = 0.01.