reproducibilityindex.ai

EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas

Authors: Mikhail Mozikov, Nikita Severin, Valeria Bodishtianu, Maria Glushanina, Ivan Nasonov, Daniil Orekhov, Pekhotin Vladislav, Ivan Makovetskiy, Mikhail Baklashkin, Vasily Lavrentyev, Akim Tsvigun, Denis Turdakov, Tatiana Shavrina, Andrey Savchenko, Ilya Makarov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental study with various LLMs demonstrated that emotions can significantly alter the ethical decision-making landscape of LLMs, highlighting the need for robust mechanisms to ensure consistent ethical standards.
Researcher Affiliation	Collaboration	Mikhail Mozikov 1,2, Nikita Severin 3,4, Valeria Bodishtianu5, Maria Glushanina6, Ivan Nasonov7, Daniil Orekhov3, Vladislav Pekhotin2, Ivan Makovetskiy2, Mikhail Baklashkin7, Vasily Lavrentyev8, Akim Tsvigun9, Denis Turdakov4, Tatiana Shavrina10, Andrey Savchenko3,4,11, Ilya Makarov 1,2,3,4,8,12 1AIRI, 2NUST MISi S, 3HSE University, 4ISP RAS, 5Cornell University, 6École normale supérieure, 7Independent Researcher, 8ITMO University, 9KU Leuven, 10Institute of Linguistics RAS, 11Sber AI Lab, 12MIPT
Pseudocode	No	The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	In this paper, we propose a novel framework for emotion modeling in LLMs, with source code publicly available on Git Hub1. 1https://github.com/AIRI-Institute/EAI-Framework
Open Datasets	Yes	Implicit Ethics: Using the ETHICS dataset [63], we use LLM to categorize morally charged scenarios as wrong or not wrong. Explicit Ethics: Employing the Moral Choice dataset [36] with scenarios featuring two choices... Stereotype Recognition: Utilizing Stereo Set [35] to recognize stereotypes in sentences classifying them into one of three classes...
Dataset Splits	No	The paper evaluates pre-trained LLMs on existing datasets (ETHICS, Moral Choice, Stereo Set), but it does not specify any training/validation/test splits that they used for their experimental setup. It mentions accuracy is computed on 'all examples' or specific subsets (e.g., 'good' and 'bad' scenarios) for evaluation purposes, but not for splitting during model training or validation in their framework.
Hardware Specification	No	The paper states in its NeurIPS Paper Checklist that 'Experiments are mostly conducted via API and thus do not require extensive computer resources,' and does not provide specific details on CPU, GPU, or memory used for the experiments.
Software Dependencies	No	The paper specifies the versions of the LLMs used (e.g., 'gpt-3.5-turbo-0125 for GPT-3.5'), but it does not list general software dependencies or programming libraries with their version numbers (e.g., Python, PyTorch, CUDA) that are part of their framework's implementation.
Experiment Setup	Yes	We fixed model versions (Appendix B.1) for reproducibility and set the temperature to 0. Additionally, we studied result consistency over five runs and temperature influence in Appendix C. ... In our experiments, we test reasoning with and without Co T. ... We introduce the budget effect and check whether or not varying the total endowment for allocation changes the behavior of LLM both in the baseline configuration and in emotional states. ... For the experiments involving the Responder in the Ultimatum Game, we predefined different offers to verify the alignment of the acceptance rates: [0.2, 0.4, 0.6, 0.8, 0.95, 1].