Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks
Authors: Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1 shows that gradient-based multi-task learning with a two-layer Re LU NN with first-layer parameters (the representation) shared among all tasks and last-layer weights (the head) learned uniquely for each task indeed recovers the ground-truth subspace with error diminishing with the number of tasks. We theoretically justify this observation, providing the first known proofs of multi-task feature learning with a nonlinear model along with a new explanation for why multi-tasking aids feature learning. Our theoretical contributions are summarized below, and verified numerically in Appendix F. F. Numerical Simulations In this section we verify our analysis with numerical simulations. We aim to both confirm that the alternating stochasticgradient descent algorithm for multi-task pretraining that we study recovers the ground-truth representation and further explore the mechanisms by which it does so. To this end, all experiments are conducted on synthetic data generated according to the model described in Section 2. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, Texas, USA 2Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA 3Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, California, USA. |
| Pseudocode | No | The paper describes the algorithm steps in paragraph form, but no formal pseudocode or algorithm block is provided. |
| Open Source Code | No | No explicit statement or link regarding the release of source code for the described methodology was found in the paper. |
| Open Datasets | No | All experiments are conducted on synthetic data generated according to the model described in Section 2. To generate M, we sample each of its elements independently from the standard normal distribution, then orthonormalize its rows via a QR decomposition. |
| Dataset Splits | No | The paper discusses training and testing, but there is no explicit mention or description of a separate validation set or splitting methodology for validation data. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., libraries, frameworks, or solvers). |
| Experiment Setup | Yes | Unless otherwise noted, we used λw = 0.05, λa = 0.5 and η = 0.1 (learning rate for both ai and W). after tuning each parameter in the set {0.01, 0.05, 0.1, 0.5, 1}, separately for single task and multi-task cases, unless r = 4. We tuned νw {0.001, 0.01, 0.1, 1}, and used νw = 0.01 for r 3, unless otherwise noted. For r = 4, we found that setting λw = 0.1 and νw = 0.001 improved performance, but did not see improvement by changing the other parameters, so kept them the same. We used a smaller learning rate of 0.001 for the bias in all cases, although we reset the bias randomly before downstream evaluation. Then, we run gradient descent on the regularized empirical hinge loss on a fixed set of N samples to learn the last layer head, i.e. linear probing. We run this gradient descent with step size η = 0.1 and ℓ2 regularizer ˆλa = 0.01. |