Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Do different prompting methods yield a common task representation in language models?

Authors: Guy Davidson, Todd Gureckis, Brenden M Lake, Adina Williams

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study this through function vectors (FVs), recently proposed as a mechanism to extract few-shot ICL task representations. We generalize FVs to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zeroshot task accuracy. We find evidence that demonstrationand instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task prompting forms do not induce a common task representation through FVs but elicit different, partly overlapping mechanisms.
Researcher Affiliation	Collaboration	Guy Davidson1, Todd M. Gureckis2, Brenden M. Lake3, Adina Williams1 1FAIR at Meta, 2New York University, 3Princeton University EMAIL EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes a procedure in text and a diagram in Figure 1, but does not contain a structured pseudocode block or algorithm section explicitly labeled as such.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide a zip archive of our code with instructions for how to set up the environment and example commands to launch our experiments.
Open Datasets	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: ... all the data we use is available as we reuse the datasets provided by Todd et al. (2024). ... All tasks used were sourced from Todd et al. s (2024) repository: https://github.com/ericwtodd/function_vectors.
Dataset Splits	Yes	We split each dataset 70% to train and 30% to test. Where we require a validation set, we split it again from the training set.
Hardware Specification	Yes	We run all of our experiments on Volta and Pascal-series GPUs, with a single GPU sufficing for every experiment we launch.
Software Dependencies	No	The paper mentions "Huggingface Transformers (Wolf et al., 2019) model implementations" but does not specify a version number for the library. While model citations like "Llama Team (2024)" are provided, these refer to the models themselves, not the software dependencies with specific version numbers.
Experiment Setup	Yes	For each model and each task, we use the J = 5 instructions with the highest accuracy over the training split. We compute the mean activations over 100 total prompts, 20 with each of the 5 best instructions. We compute the causal indirect effects over 25 total uninformative prompts, 5 generated for each of the best instructions. We batch our results with a batch size that depends on the model and task, but does not exceed 5 for any model or task. ... We load all models in full precision. We use the \|A\| = 20 top heads in all experiments we report. We evaluate the FV interventions at every possible depth (that is, after every layer of the model). In all main manuscript figures, we report the accuracy intervening after the \|L/3\| layer.