Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

Authors: Renpu Liu, Ruida Zhou, Cong Shen, Jing Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the ICL performances of Transformers with traditional iterative algorithms and L2O algorithms for sparse recovery empirically. Our experimental results indicate that Transformers substantially outperform traditional gradient-descent-based iterative algorithms, and achieve comparable performances compared with L2O algorithms that are trained and tested using data generated with the same measurement matrix. This supports our claim that Transformers can implement L2O algorithms during ICL. Besides, Transformers also demonstrate remarkable generalization capability when the measurement matrix varies, and achieve accelerated convergence when additional structure is imposed on the underlying sparse vectors, supporting our theoretical findings.
Researcher Affiliation Academia Renpu Liu1 , Ruida Zhou2 , Cong Shen1, Jing Yang1 1University of Virginia, 2University of California, Los Angeles EMAIL EMAIL
Pseudocode No The paper describes the Transformer architecture and the LISTA-type algorithm in text and mathematical equations. It does not contain any distinct section labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets No The paper describes generating synthetic data for its experiments: 'we sample a ground truth sparse vector β from a d = 20 dimensional standard normal distribution, and we fix the sparsity of β to be 3 by randomly setting 17 entries in β to zero. Next, we independently sample N = 10 vectors form a d dimensional standard normal distribution and then contract the measurement matrix X R10 20 (each sampled d dimensional random vector is a row in X). We also sample an additional x N+1 from the d-dimensional standard Gaussian distribution.' It does not refer to or provide access to any pre-existing public dataset.
Dataset Splits No The paper describes how instances are generated for training and inference, but it does not specify explicit train/test/validation splits of a fixed dataset. Data instances are generated randomly per epoch: 'For each training epoch, we create I = 50,000 instances from 50,000 randomly generated sparse vectors.' and 'randomly sample 64 instances per epoch and train the algorithms for 106 epochs.'
Hardware Specification Yes We run the experiments for Small TF and GPT-2 on an NVIDIA RTX A5000 GPU with 24G memory.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes For all of these algorithms, we set the number of iterations K = 12. We generate a single fixed measurement matrix X. For each training epoch, we create I = 50,000 instances from 50,000 randomly generated sparse vectors. Small TF has 12 layers, each containing a self-attention layer followed by an MLP layer. Each self-attention layer has 4 attention heads. We set the embedding dimension to D = 42, and the embedding structure according to Equation (4.2). For GPT-2, we employ 12 layers, and set the embedding dimension to 256 and the number of attention heads per layer to 8. ... we randomly generate 64 instances per epoch and train the algorithms for 106 epochs. The training process minimizes the label prediction loss PN+1 j=1 (yj byj)2.