M-L2O: Towards Generalizable Learning-to-Optimize by Test-Time Fast Self-Adaptation
Authors: Junjie Yang, Xuxi Chen, Tianlong Chen, Zhangyang Wang, Yingbin Liang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical observations on several classic tasks like LASSO, Quadratic and Rosenbrock demonstrate that M-L2O converges significantly faster than vanilla L2O with only 5 steps of adaptation, echoing our theoretical results. Codes are available in https://github.com/VITA-Group/M-L2O." and "5 EXPERIMENTS In this section, we provide a comprehensive description of the experimental settings and present the results we obtained. Our findings demonstrate a high degree of consistency between the empirical observations and the theoretical outcomes. |
| Researcher Affiliation | Academia | Junjie Yang1, Xuxi Chen2, Tianlong Chen2, Zhangyang Wang2, Yingbin Liang1 1The Ohio State University, 2University of Texas at Austin |
| Pseudocode | Yes | Algorithm 1 Our Proposed M-L2O. |
| Open Source Code | Yes | Codes are available in https://github.com/VITA-Group/M-L2O. |
| Open Datasets | Yes | Optimizees. We conduct experiments on three distinct optimizees, namely LASSO, Quadratic, and Rosenbrock (Rosenbrock, 1960). The formulation of the Quadratic problem is minx 1 2 Ax b 2 and the formulation of the LASSO problem is minx 1 2 Ax b 2 + λ x 1, where A Rd d, b Rd. We set λ = 0.005. The precise formulation of the Rosenbrock problem is available in Section A.6. During the meta-training and testing stage, the optimizees ξtrain and ξtest are drawn from the pre-specified distributions Dtrain and Dtest, respectively. Similarly, the optimizees ξadapt used during adaptation are sampled from the distribution Dadapt. |
| Dataset Splits | No | The paper mentions 'training', 'adaptation', and 'testing' optimizees/tasks but does not explicitly provide details on a separate validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using a 'single-layer LSTM network' and 'Adam' optimizer but does not specify version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For all our experiments, we use a single-layer LSTM network with 20 hidden units as the backbone. We adopt the methodology proposed by Lv et al. (2017) and Chen et al. (2020a) to utilize the parameters gradients and their corresponding normalized momentum to construct the observation vectors. [...] For all experiments, we set the number of optimizee iterations, denoted by T, to 20 when meta-training the L2O optimizers and adapting to optimizees. [...] The value of the total epochs, denoted by K, is set to 5000, and we adopt the curriculum learning technique (Chen et al., 2020a) to dynamically adjust the number of epochs per task, denoted by S. To update the weights of the optimizers (ϕ), we use Adam (Kingma & Ba, 2014) with a fixed learning rate of 1 × 10^−4. |