Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prompt Tuning Transformers for Data Memorization

Authors: Haiyu Wang, Yuanyuan Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we provide both theoretical and empirical analyses of data memorization ability of prompt-tuned Transformers. Building on recent theoretical frameworks, we derive an upper bound on the required prompt length for exact memorization of finite datasets and establish a trade-off between prompt length and the number of autoregressive generation steps. Specifically, we show that a constant-size Transformer can memorize n input-output pairs with prompts of length O( n N), where N denotes the sequence length. Empirical results further demonstrate that prompt-tuned, randomly initialized Transformers are able to effectively memorize finite datasets. These models also capture the intrinsic low-rank structure of the data, leading to a reduction in the required prompt length. Finally, we analyze how the initialization of the Transformer backbone affects the performance of prompt tuning. Our findings provide new insights into the expressivity, efficiency, and underlying mechanisms of prompt tuning, bridging theoretical memorization limits with observed empirical behaviors.
Researcher Affiliation Academia Haiyu Wang Department of Statistics and Data Science The Chinese University of Hong Kong Hai Yu EMAIL Yuanyuan Lin Department of Statistics and Data Science The Chinese University of Hong Kong EMAIL
Pseudocode No The paper defines mathematical formulations of Transformer components like self-attention and feed-forward layers, and provides formal definitions for concepts like autoregressive generation and prompt tuning. However, it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The code will be publically available upon acceptance.
Open Datasets Yes The data points to be memorized are randomly sampled from the IMDb [Maas et al., 2011] dataset. ... We randomly sample 1000 samples from SST-2 dataset [Socher et al., 2013], which are truncated to a length of 8.
Dataset Splits No For dataset sizes of 1600, 2500, and 3600. ... The training dataset size is 2000 and test dataset size is 200. While specific sizes are mentioned, the paper does not provide explicit percentages or methodologies for how the training, validation, and test splits were created for all experiments, nor does it cite standard splits consistently for all datasets used.
Hardware Specification Yes All the experiments are conducted on one NVIDIA T4 GPU.
Software Dependencies No Our code is based on standard Py Torch modules. We use the Roberta-base (12 heads and 12 layers) implementation of Hugginface [Wolf et al., 2019]. The paper mentions PyTorch and HuggingFace, but does not provide specific version numbers for these software components.
Experiment Setup Yes Number of training epochs is 1000, leanring rate is 0.005. Optimizer is Adam W [Loshchilov and Hutter, 2017]. ... Number of training epochs is 100, learning rate is 0.001. Optimizer is Adam W. We use a two-layer randomly initialized Transformer with an embedding size of 512... The input sequence length is 16 where the first 8 tokens are prompt tokens and the remaining 8 are data tokens.