Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Are Large Language Models Sensitive to the Motives Behind Communication?

Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Ted Sumers, Tom Griffiths

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs behavior is consistent with rational models of learning from motivated testimony... Experiment 1: Can LLMs discriminate... Experiment 2: Can LLMs exercise nuanced vigilance... Experiment 3: Do LLMs generalize vigilance...
Researcher Affiliation Collaboration Addison J. Wu1 Ryan Liu1 Kerem Oktar2 Theodore R. Sumers3 Thomas L. Griffiths1,2 1Department of Computer Science, Princeton University 2Department of Psychology, Princeton University 3Anthropic
Pseudocode No The paper describes methodologies through textual descriptions, mathematical models, and experimental setups (e.g., Figure 1 and Appendix A.1 for prompts), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code will be provided in the supplemental material.
Open Datasets Yes We first obtain a comprehensive dataset of existing sponsorships on the video-hosting website You Tube from Sponsor Block [71].
Dataset Splits No The paper describes data used as experimental stimuli (e.g., "300 randomly selected video IDs", "20 easy images", "20 hard images") and analysis splits (e.g., "shortest 25% (Q1) and longest 25% (Q4) of transcripts"), but does not specify training/test/validation dataset splits for models developed or trained within the paper itself.
Hardware Specification No Our experiments were run strictly by API and not locally, thus we did not need local compute resources.
Software Dependencies No The paper primarily evaluates existing LLM APIs (GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3-70B). It does not list any specific software libraries or frameworks with version numbers for the authors' own implementation or analysis.
Experiment Setup Yes For each model, payoff structure, and prompting method, we conduct n = 30 trials over the same 20 pairs of images (order shuffled every trial), with temperature = 1. ... All models are sampled at temperature 1. ... queried each video and prompting combination n = 1 time with temperature = 0 to minimize variability.