Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Concept Incongruence: An Exploration of Time and Death in Role Playing

Authors: Xiaoyan Bai, Ike Peng, Aditya Singh, Chenhao Tan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run experiments under both ROLE-PLAY and NON-ROLE-PLAY settings, using our three behavioral metrics to evaluate model behavior. ... we apply the three behavioral metrics to role playing interactions with open-sourced models (Llama-3.1-8B-Instructed [39], Gemma-2-9b-Instructed [38]), and large frontier commercial models (GPT-4.1-nano and Claude-3.7-Sonnet). We conduct the experiments on one A40 GPU.
Researcher Affiliation Academia Xiaoyan Bai Ike Peng* Aditya Singh* Chenhao Tan University of Chicago EMAIL
Pseudocode No The paper describes methods and procedures in paragraph text, for example, under '2 Experiment Setup' and '4 Understanding Model Behavior under Concept Incongruence', but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is available at: https://github.com/ChicagoHAI/concept-incongruence.
Open Datasets Yes Role-play dataset. We collect a total of 100 real historical figures, all of whom died between 1890 and 1993. ... More details are in Appendix A. ... For generalized temporal-question probes, we apply the same procedure to the entertainment dataset proposed before [12], which contains 31,321 items (24,884 train, 6,437 test).
Dataset Splits Yes For the layer-wise dead / alive probe, we compile 1,000 dead and 1,000 alive individuals. We split 80%/20% for training and testing. ... We train temporal-representation probes on 277 U.S. president questions with an 80%/20% train test split. ... For generalized temporal-question probes, we apply the same procedure to the entertainment dataset proposed before [12], which contains 31,321 items (24,884 train, 6,437 test).
Hardware Specification Yes We conduct the experiments on one A40 GPU.
Software Dependencies No The paper mentions specific open-sourced and commercial models used (Llama-3.1-8B-Instructed, Gemma-2-9b-Instructed, GPT-4.1-nano, Claude-3.7-Sonnet, GPT-4o-mini as judge) but does not provide specific version numbers for any programming languages, libraries, or other ancillary software dependencies.
Experiment Setup Yes Table 4: Hyperparameters for probe training. Hyperparameters Dead/Alive Death Year learning rate 0.001 0.001 batch size 100 100 epochs 500 500