EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas

Authors: Mikhail Mozikov, Nikita Severin, Valeria Bodishtianu, Maria Glushanina, Ivan Nasonov, Daniil Orekhov, Pekhotin Vladislav, Ivan Makovetskiy, Mikhail Baklashkin, Vasily Lavrentyev, Akim Tsvigun, Denis Turdakov, Tatiana Shavrina, Andrey Savchenko, Ilya Makarov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental study with various LLMs demonstrated that emotions can significantly alter the ethical decision-making landscape of LLMs, highlighting the need for robust mechanisms to ensure consistent ethical standards.
Researcher Affiliation Collaboration Mikhail Mozikov 1,2, Nikita Severin 3,4, Valeria Bodishtianu5, Maria Glushanina6, Ivan Nasonov7, Daniil Orekhov3, Vladislav Pekhotin2, Ivan Makovetskiy2, Mikhail Baklashkin7, Vasily Lavrentyev8, Akim Tsvigun9, Denis Turdakov4, Tatiana Shavrina10, Andrey Savchenko3,4,11, Ilya Makarov 1,2,3,4,8,12 1AIRI, 2NUST MISi S, 3HSE University, 4ISP RAS, 5Cornell University, 6École normale supérieure, 7Independent Researcher, 8ITMO University, 9KU Leuven, 10Institute of Linguistics RAS, 11Sber AI Lab, 12MIPT
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes In this paper, we propose a novel framework for emotion modeling in LLMs, with source code publicly available on Git Hub1. 1https://github.com/AIRI-Institute/EAI-Framework
Open Datasets Yes Implicit Ethics: Using the ETHICS dataset [63], we use LLM to categorize morally charged scenarios as wrong or not wrong. Explicit Ethics: Employing the Moral Choice dataset [36] with scenarios featuring two choices... Stereotype Recognition: Utilizing Stereo Set [35] to recognize stereotypes in sentences classifying them into one of three classes...
Dataset Splits No The paper evaluates pre-trained LLMs on existing datasets (ETHICS, Moral Choice, Stereo Set), but it does not specify any training/validation/test splits that *they* used for their experimental setup. It mentions accuracy is computed on 'all examples' or specific subsets (e.g., 'good' and 'bad' scenarios) for evaluation purposes, but not for splitting during model training or validation in their framework.
Hardware Specification No The paper states in its NeurIPS Paper Checklist that 'Experiments are mostly conducted via API and thus do not require extensive computer resources,' and does not provide specific details on CPU, GPU, or memory used for the experiments.
Software Dependencies No The paper specifies the versions of the LLMs used (e.g., 'gpt-3.5-turbo-0125 for GPT-3.5'), but it does not list general software dependencies or programming libraries with their version numbers (e.g., Python, PyTorch, CUDA) that are part of their framework's implementation.
Experiment Setup Yes We fixed model versions (Appendix B.1) for reproducibility and set the temperature to 0. Additionally, we studied result consistency over five runs and temperature influence in Appendix C. ... In our experiments, we test reasoning with and without Co T. ... We introduce the budget effect and check whether or not varying the total endowment for allocation changes the behavior of LLM both in the baseline configuration and in emotional states. ... For the experiments involving the Responder in the Ultimatum Game, we predefined different offers to verify the alignment of the acceptance rates: [0.2, 0.4, 0.6, 0.8, 0.95, 1].