Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Codifying Character Logic in Role-Playing

Authors: Letian Peng, Jingbo Shang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing an efficient, lightweight foundation for local deployment of role-play agents. ... To evaluate the effectiveness of codified profiles, we construct a new Fandom Benchmark comprising 83 characters and 5,141 scenes curated from Fandom, spanning mangas, novels, and television series. ... Our experiments start with a comparison between our codified profiles and traditional textual profiles, and then dive deep into the effect of evolving profiles that update sequentially along the episode timeline, the influence of profile-driven randomness, and the model s Best@K performance under stochastic response settings.
Researcher Affiliation	Academia	Letian Peng, Jingbo Shang Department of Computer Science University of California, San Diego EMAIL
Pseudocode	Yes	Converted by large language model (LLM) from textual profiles, each codified profile defines a set of functions parse_by_scene(scene) that output multiple logic-grounded assertions according to scene, using both explicit control structures (e.g., if-then-else) and flexible check_condition(scene, question) functions where each question is a semantically meaningful prompt about the scene (e.g., Is the character in danger? ) discriminated by the roleplaying LLM as true, false, or unknown. ... In practice, we utilize LLM s coding ability to codify each profile segment pi into an executable function fi : s ptrig i (Code implementation: parse_by_scene(scene) triggered_statements), which returns the possible reactions ptrig i in the scene s based on the logic written in pi.
Open Source Code	Yes	Codes and datasets are available at https://github.com/Komeiji Force/Codified_Profile_Koishiday_2025
Open Datasets	Yes	We introduce a new benchmark, constructed from Fandom-sourced scenes and characters, for evaluating role-playing consistency. We plan to open-source this dataset to facilitate future research. ... Codes and datasets are available at https://github.com/Komeiji Force/Codified_Profile_Koishiday_2025
Dataset Splits	Yes	Instead of querying the role-playing model for every condition, we distill from gpt-4.1 s condition-checking outputs using 415 scenes (5 per character, 8% of all scenes) and obtain 20,759 labeled discrimination cases. A 3-class deberta-v3-base model (0.1B) [He et al., 2021] is trained on 90% of the data for 5 epochs and achieves 70.53% consistency with gpt-4.1 on the held-out 10%.
Hardware Specification	Yes	We calculate efficiency on a single 80G A100 with batch size 1.
Software Dependencies	Yes	Then, we employ llama-3.1-8b-instruct as the role-playing LLM. ... We compare vanilla prompting, textual profiles with chain-of-thought, and codified profiles across LLaMA-3 models of 1B (3.2), 3B (3.2), and 8B (3.1) parameters in Figure 7 ... A 3-class deberta-v3-base model (0.1B) [He et al., 2021] is trained on 90% of the data for 5 epochs and achieves 70.53% consistency with gpt-4.1 on the held-out 10%.
Experiment Setup	Yes	In Stochastic Response, the role-playing LLM generates K responses per scene using a fixed codified profile with probabilistic constructs (e.g., random.choice) and temperature < 0.7. ... A 3-class deberta-v3-base model (0.1B) [He et al., 2021] is trained on 90% of the data for 5 epochs and achieves 70.53% consistency with gpt-4.1 on the held-out 10%.