Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Authors: Tim Genewein, Kevin Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens.
Researcher Affiliation Industry Tim Genewein Li Kevin Wenliang Jordi Grau-Moya Anian Ruoss Laurent Orseau Google Deep Mind Marcus Hutter Correspondence to EMAIL.
Pseudocode No The paper describes algorithms and methods conceptually and mathematically, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes The code to reproduce all our experiments is available at: https://github.com/google-deepmind/thunnini. Our main contributions are: We discuss how prompting can be understood as steering a Bayesian sequential predictor via its pretrained in-context adaptation mechanism that arises from meta-training, see Section 2.
Open Datasets No We use coin-flip sequences P(x1:N|τ) = Bernoulli(τ) with three different distributions P(τ) throughout our experiments. Random coins: P(τ) = Beta(1, 1), leading to a uniform distribution over coin biases. This is our pretraining distribution. The exact Bayesian predictor in this case is the Laplace predictor. Single coin: A single coin with bias 0.2. This target distribution fulfills the condition that makes optimal prompting possible in Eq. (8). Two-coin mixture: A mixture of two coins, one with bias 0.2 and one with bias 0.8, with equal mixing weights of 1/2 each. This target distribution violates the condition that theoretically allows for optimal prompting in Eq. (8).
Dataset Splits No We conduct a series of experiments where a neural network is first meta-trained over a pretraining distribution (Section 2) of coin flip sequences of length Npre = 100, and then prompt-tuned (Eq. (7)) or weight-tuned (mini-batch based log-loss minimization) to a target distribution of coin flip sequences of length Ntune = 50. After tuning, we evaluate tuned models on 2048 sequences of length NEval = 200 from the target distribution.
Hardware Specification Yes The educational experiments presented in the paper were run on a single V100 GPU in under 6 hours.
Software Dependencies No The paper does not explicitly list specific software components with their version numbers, such as Python, PyTorch, or other libraries.
Experiment Setup Yes Training and tuning details. We pretrain for 1000 gradient steps (batch size 256, sequence length Npre = 100, learning rate 0.001, and gradient clipping if the norm is 1). For tuning we use 1000 steps (batch size of 256, thus K = 256, 000, sequence length Ntune = 50, learning rate of 5e 3, and gradient clipping if the norm is 1). Neural sequential predictors. We evaluate both LSTMs and Decoder-only Transformers. To support all fine-tuning methods we always use an initial embedding, and a final unembedding layer. The embedding is a trainable linear projection from the 2D token space into a 128-dimensional embedding space (results for 4-dimensional embeddings are shown in Appendix I). The unembedding is a trainable linear projection from the outputs of the final network layer down to the 2D logits. Implementation Details. The LSTM has a single hidden layer of width 128; the Transformer has a single multi-head attention layer with output dimensionality of 128, 4 attention heads, causal masking, Sin Cos positional encoding, a widening factor of 4 for the MLP block, and layer normalization after query and key dense layers.