reproducibilityindex.ai

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Authors: Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs impersonations are complementary to visual information when describing different categories.
Researcher Affiliation	Academia	1 University of Tübingen 2 Tübingen AI Center 3 University of Porto 4 INESC TEC 5 Max Planck Institute for Biological Cybernetics
Pseudocode	No	The paper describes its methodology and tasks using natural language and mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/ Explainable ML/in-context-impersonation.
Open Datasets	Yes	Our experiments on expertise-based impersonation (details in Section 3.3) are conducted on the MMLU dataset [67]... two state-of-the-art fine-grained visual categorization datasets, i.e. Caltech UCSD Birds (CUB) [78] and Stanford Cars [79]
Dataset Splits	No	The paper states 'All experiments are performed on the test splits' and discusses evaluation on these splits, but it does not explicitly provide details on training or validation dataset splits, percentages, or sample counts.
Hardware Specification	Yes	All experiments are performed on the test splits using a single A100-40GB GPU and we mention inference times in suppl. Section A.2.
Software Dependencies	No	The paper mentions software components like 'Vicuna-13B language model', 'Open AI API of Chat GPT with the gpt-3.5-turbo model', and 'CLIP (or variants thereof)', but it does not provide specific version numbers for the software libraries or frameworks used.
Experiment Setup	Yes	When sampling full sentences, we use a temperature of 0.7; to obtain the answer as a single symbol (token), we set it to 1 unless otherwise stated. These different temperatures were chosen based on the recommended default values of each LLM... For Vicuna-13B we use the default temperature of 0.7 and the default top-k value of k = 50. For Chat GPT we use the default temperature of 1.0.