How Fine-Tuning Allows for Effective Meta-Learning

Authors: Kurtland Chua, Qi Lei, Jason D. Lee

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a formal construction and an experimental verification of the gap in Section C. Furthermore, we extend the linear hard case to a nonlinear setting in Section G. In Section C, we provide the formal construction and an experimental verification of the gap in Section C.3. Our experiments only involve simulations of simple settings that do not require extensive compute.
Researcher Affiliation Academia Kurtland Chua Princeton University kchua@princeton.edu Qi Lei Princeton University qilei@princeton.edu Jason D. Lee Princeton University jasonlee@princeton.edu
Pseudocode No The paper describes algorithms like ADAPTREP and FROZENREP but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes A Jupyter notebook is provided to run the simulation outlined in Section C.3.
Open Datasets No The paper does not use named public datasets; instead, it describes synthetic data generation for its theoretical analysis and simulations, e.g.,
Dataset Splits No We do not use train-validation splits, as is widespread in practice. This is motivated by results in Bai et al. (2020), which show that data splitting may be undesirable, assuming realizability.
Hardware Specification No Our experiments only involve simulations of simple settings that do not require extensive compute.
Software Dependencies No The paper mentions a
Experiment Setup Yes Source training. We consider the following regularized form of (1): min B min t,wt 1 2n ST t=1 yt Xt(B + t)wt 2 2 + λ 2 t 2 F + γ 2 wt 2 2 . In Section B, we show that the regularization is equivalent to regularizing λγ twt 2, consistent with the intuition that t w t has small norm. This additional regularization is necessary, since (1) only controls the norm of t, which is insufficient for controlling twt. Target training. Let B0 be the output of (4) after orthonormalizing. We adapt to the target task via Lβ( , w) = 1 2n y βX (AB0 + ) (w0 + w) 2 2 , (5) where AB0 := [B0 B0] Rd 2k and w0 = [u, u] for a fixed unit-norm vector u Rk. This corresponds to training a predictor of the form x 7 x, (AB0 + )(w0 + w) . We optimize (5) by performing TPGD steps of PGD with stepsize η on (5) with Cβ := {( , w) | F c1/β, w 2 c2/β} as the feasible set, where we explicitly define c1 and c2 in Section B. In Section C.3, it also states: