M-L2O: Towards Generalizable Learning-to-Optimize by Test-Time Fast Self-Adaptation

Authors: Junjie Yang, Xuxi Chen, Tianlong Chen, Zhangyang Wang, Yingbin Liang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical observations on several classic tasks like LASSO, Quadratic and Rosenbrock demonstrate that M-L2O converges significantly faster than vanilla L2O with only 5 steps of adaptation, echoing our theoretical results. Codes are available in https://github.com/VITA-Group/M-L2O." and "5 EXPERIMENTS In this section, we provide a comprehensive description of the experimental settings and present the results we obtained. Our findings demonstrate a high degree of consistency between the empirical observations and the theoretical outcomes.
Researcher Affiliation Academia Junjie Yang1, Xuxi Chen2, Tianlong Chen2, Zhangyang Wang2, Yingbin Liang1 1The Ohio State University, 2University of Texas at Austin
Pseudocode Yes Algorithm 1 Our Proposed M-L2O.
Open Source Code Yes Codes are available in https://github.com/VITA-Group/M-L2O.
Open Datasets Yes Optimizees. We conduct experiments on three distinct optimizees, namely LASSO, Quadratic, and Rosenbrock (Rosenbrock, 1960). The formulation of the Quadratic problem is minx 1 2 Ax b 2 and the formulation of the LASSO problem is minx 1 2 Ax b 2 + λ x 1, where A Rd d, b Rd. We set λ = 0.005. The precise formulation of the Rosenbrock problem is available in Section A.6. During the meta-training and testing stage, the optimizees ξtrain and ξtest are drawn from the pre-specified distributions Dtrain and Dtest, respectively. Similarly, the optimizees ξadapt used during adaptation are sampled from the distribution Dadapt.
Dataset Splits No The paper mentions 'training', 'adaptation', and 'testing' optimizees/tasks but does not explicitly provide details on a separate validation dataset split.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies No The paper mentions using a 'single-layer LSTM network' and 'Adam' optimizer but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes For all our experiments, we use a single-layer LSTM network with 20 hidden units as the backbone. We adopt the methodology proposed by Lv et al. (2017) and Chen et al. (2020a) to utilize the parameters gradients and their corresponding normalized momentum to construct the observation vectors. [...] For all experiments, we set the number of optimizee iterations, denoted by T, to 20 when meta-training the L2O optimizers and adapting to optimizees. [...] The value of the total epochs, denoted by K, is set to 5000, and we adopt the curriculum learning technique (Chen et al., 2020a) to dynamically adjust the number of epochs per task, denoted by S. To update the weights of the optimizers (ϕ), we use Adam (Kingma & Ba, 2014) with a fixed learning rate of 1 × 10^−4.