Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Are Large Language Models Sensitive to the Motives Behind Communication?

Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Ted Sumers, Tom Griffiths

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs behavior is consistent with rational models of learning from motivated testimony... Experiment 1: Can LLMs discriminate... Experiment 2: Can LLMs exercise nuanced vigilance... Experiment 3: Do LLMs generalize vigilance...
Researcher Affiliation	Collaboration	Addison J. Wu1 Ryan Liu1 Kerem Oktar2 Theodore R. Sumers3 Thomas L. Griffiths1,2 1Department of Computer Science, Princeton University 2Department of Psychology, Princeton University 3Anthropic
Pseudocode	No	The paper describes methodologies through textual descriptions, mathematical models, and experimental setups (e.g., Figure 1 and Appendix A.1 for prompts), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code will be provided in the supplemental material.
Open Datasets	Yes	We first obtain a comprehensive dataset of existing sponsorships on the video-hosting website You Tube from Sponsor Block [71].
Dataset Splits	No	The paper describes data used as experimental stimuli (e.g., "300 randomly selected video IDs", "20 easy images", "20 hard images") and analysis splits (e.g., "shortest 25% (Q1) and longest 25% (Q4) of transcripts"), but does not specify training/test/validation dataset splits for models developed or trained within the paper itself.
Hardware Specification	No	Our experiments were run strictly by API and not locally, thus we did not need local compute resources.
Software Dependencies	No	The paper primarily evaluates existing LLM APIs (GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3-70B). It does not list any specific software libraries or frameworks with version numbers for the authors' own implementation or analysis.
Experiment Setup	Yes	For each model, payoff structure, and prompting method, we conduct n = 30 trials over the same 20 pairs of images (order shuffled every trial), with temperature = 1. ... All models are sampled at temperature 1. ... queried each video and prompting combination n = 1 time with temperature = 0 to minimize variability.