How Fine-Tuning Allows for Effective Meta-Learning
Authors: Kurtland Chua, Qi Lei, Jason D. Lee
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a formal construction and an experimental verification of the gap in Section C. Furthermore, we extend the linear hard case to a nonlinear setting in Section G. In Section C, we provide the formal construction and an experimental verification of the gap in Section C.3. Our experiments only involve simulations of simple settings that do not require extensive compute. |
| Researcher Affiliation | Academia | Kurtland Chua Princeton University kchua@princeton.edu Qi Lei Princeton University qilei@princeton.edu Jason D. Lee Princeton University jasonlee@princeton.edu |
| Pseudocode | No | The paper describes algorithms like ADAPTREP and FROZENREP but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | A Jupyter notebook is provided to run the simulation outlined in Section C.3. |
| Open Datasets | No | The paper does not use named public datasets; instead, it describes synthetic data generation for its theoretical analysis and simulations, e.g., |
| Dataset Splits | No | We do not use train-validation splits, as is widespread in practice. This is motivated by results in Bai et al. (2020), which show that data splitting may be undesirable, assuming realizability. |
| Hardware Specification | No | Our experiments only involve simulations of simple settings that do not require extensive compute. |
| Software Dependencies | No | The paper mentions a |
| Experiment Setup | Yes | Source training. We consider the following regularized form of (1): min B min t,wt 1 2n ST t=1 yt Xt(B + t)wt 2 2 + λ 2 t 2 F + γ 2 wt 2 2 . In Section B, we show that the regularization is equivalent to regularizing λγ twt 2, consistent with the intuition that t w t has small norm. This additional regularization is necessary, since (1) only controls the norm of t, which is insufficient for controlling twt. Target training. Let B0 be the output of (4) after orthonormalizing. We adapt to the target task via Lβ( , w) = 1 2n y βX (AB0 + ) (w0 + w) 2 2 , (5) where AB0 := [B0 B0] Rd 2k and w0 = [u, u] for a fixed unit-norm vector u Rk. This corresponds to training a predictor of the form x 7 x, (AB0 + )(w0 + w) . We optimize (5) by performing TPGD steps of PGD with stepsize η on (5) with Cβ := {( , w) | F c1/β, w 2 c2/β} as the feasible set, where we explicitly define c1 and c2 in Section B. In Section C.3, it also states: |