reproducibilityindex.ai

Evaluating and Inducing Personality in Pre-trained Language Models

Authors: Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, Yixin Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By systematically evaluating LLMs with MPI, we provide the first piece of evidence demonstrating the efficacy of MPI in studying LLMs behaviors. We further devise a PERSONALITY PROMPTING (P2) method to induce LLMs with specific personalities in a controllable way, capable of producing diverse and verifiable behaviors.
Researcher Affiliation	Academia	1 Institute for Artificial Intelligence, Peking University 2 Yuanpei College, Peking University 3 National Key Laboratory of General Artificial Intelligence, BIGAI 4 Beijing Jiaotong University
Pseudocode	No	The paper describes methods in prose and with figures but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets	Yes	We build MPI’s items upon International Personality Item Pool (IPIP) with its IPIP-NEO derivations (Goldberg et al., 1999, 2006; Johnson, 2005, 2014) in the public domain and Lang et al. (2011)’s BFI-S.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification	Yes	All LLMs are either from Hugging Face Transformers (Wolf et al., 2020) or Eleuther AI’s releases (Black et al., 2022), running on either eight NVIDIA A100 80GB or two RTX 3090 GPUs.
Software Dependencies	No	The paper mentions software like Hugging Face Transformers, Eleuther AI releases, and OpenAI API, but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	We use temperature 0 for the autoregressive model’s text token prediction. Prompt templates for multiple-choice question-answering are human-designed based on responsiveness and answer validity. Tab. 1 shows an example prompt used for GPT-3.5.