In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Authors: Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs impersonations are complementary to visual information when describing different categories. |
| Researcher Affiliation | Academia | 1 University of Tübingen 2 Tübingen AI Center 3 University of Porto 4 INESC TEC 5 Max Planck Institute for Biological Cybernetics |
| Pseudocode | No | The paper describes its methodology and tasks using natural language and mathematical equations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ Explainable ML/in-context-impersonation. |
| Open Datasets | Yes | Our experiments on expertise-based impersonation (details in Section 3.3) are conducted on the MMLU dataset [67]... two state-of-the-art fine-grained visual categorization datasets, i.e. Caltech UCSD Birds (CUB) [78] and Stanford Cars [79] |
| Dataset Splits | No | The paper states 'All experiments are performed on the test splits' and discusses evaluation on these splits, but it does not explicitly provide details on training or validation dataset splits, percentages, or sample counts. |
| Hardware Specification | Yes | All experiments are performed on the test splits using a single A100-40GB GPU and we mention inference times in suppl. Section A.2. |
| Software Dependencies | No | The paper mentions software components like 'Vicuna-13B language model', 'Open AI API of Chat GPT with the gpt-3.5-turbo model', and 'CLIP (or variants thereof)', but it does not provide specific version numbers for the software libraries or frameworks used. |
| Experiment Setup | Yes | When sampling full sentences, we use a temperature of 0.7; to obtain the answer as a single symbol (token), we set it to 1 unless otherwise stated. These different temperatures were chosen based on the recommended default values of each LLM... For Vicuna-13B we use the default temperature of 0.7 and the default top-k value of k = 50. For Chat GPT we use the default temperature of 1.0. |