Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unlabeled Data Can Provably Enhance In-Context Learning of Transformers
Authors: Renpu Liu, Jing Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that the augmented ICL framework consistently outperforms conventional few-shot ICL, providing empirical support for our theoretical findings. To the best of our knowledge, this is the first theoretical study on the impact of unlabeled data on the ICL performance of transformers. |
| Researcher Affiliation | Academia | Renpu Liu University of Virginia Charlottesville, VA 22903 EMAIL Jing Yang University of Virginia Charlottesville, VA 22903 EMAIL |
| Pseudocode | No | The paper describes methods using mathematical formulations and prose, but it does not contain any clearly labeled pseudocode blocks or algorithms in a structured format. |
| Open Source Code | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in the supplemental material. |
| Open Datasets | No | Problem setup. In the following experiments, the augmented ICL instances are generated as follows. We set the number of classes C = 3 and the data dimension d = 3. The class mean vectors {µi}C i=1 are randomly sampled from a d-dimensional standard normal distribution. The covariance matrix Σ = ϵId is shared across classes, where Id is the d-dimensional identity matrix. We set ϵ {0.7, 1.5}. Each instance contains N = 5 labeled data points and M unlabeled data points, where M {1, 10, 20}. The M = 1 case recovers the conventional ICL setting. |
| Dataset Splits | No | During training, in each iteration, we randomly generate 64 augmented ICL instances, and perform one gradient descent (GD) on the average empirical Co T training loss defined in Equation (5.1) over the batch. In total, we perform 15, 000 GD iterations during training. For evaluation, we randomly generated 100 augmented ICL instances, and obtained the corresponding class mean estimates from the trained transformer through Co T prompting. |
| Hardware Specification | Yes | Compute resources. All experiments are conducted on an NVIDIA H100 GPU with 80 GB of memory. The experiments require roughly five hours to complete. |
| Software Dependencies | No | The paper discusses various methodologies and components like transformers, attention mechanisms, and MLPs, but does not specify any software libraries or their version numbers (e.g., PyTorch, TensorFlow, Python versions) used for implementation. |
| Experiment Setup | Yes | Problem setup. In the following experiments, the augmented ICL instances are generated as follows. We set the number of classes C = 3 and the data dimension d = 3. The class mean vectors {µi}C i=1 are randomly sampled from a d-dimensional standard normal distribution. The covariance matrix Σ = ϵId is shared across classes, where Id is the d-dimensional identity matrix. We set ϵ {0.7, 1.5}. Each instance contains N = 5 labeled data points and M unlabeled data points, where M {1, 10, 20}. The M = 1 case recovers the conventional ICL setting. Transformer structure. We construct a transformer with the architecture specified in Theorem 4.1. This model features 4 layers, with each layer composed of an attention module followed by an MLP module. Activation functions for the attention layers are configured as follows: softmax for the first layer, linear for the second and third layers, and Re LU for the fourth layer. We set dp = 16, and the number of Co T steps T = 5. During training, in each iteration, we randomly generate 64 augmented ICL instances, and perform one gradient descent (GD) on the average empirical Co T training loss defined in Equation (5.1) over the batch. In total, we perform 15, 000 GD iterations during training. |