reproducibilityindex.ai

Apathetic or Empathetic? Evaluating LLMs' Emotional Alignments with Humans

Authors: Jen-Tse Huang, Man Ho LAM, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R Lyu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluating Large Language Models (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes seven LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4, Mixtral-8x22B, and LLa MA-3.1. We find that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, i.e., Emotion Bench, are publicly available at https://github.com/CUHK-ARISE/Emotion Bench.
Researcher Affiliation	Collaboration	Jen-tse Huang1 Man Ho Lam1 Eric John Li1 Shujie Ren2 Wenxuan Wang1 Wenxiang Jiao3 Zhaopeng Tu3 Michael R. Lyu1 1Department of Computer Science and Engineering, The Chinese University of Hong Kong 2Institute of Psychology, Tianjin Medical University 3Tencent AI Lab
Pseudocode	No	The paper describes the steps of its framework (Default Emotion Measure, Situation Imagination, Evoked Emotion Measure) but does not present them in a pseudocode or algorithm block format.
Open Source Code	Yes	Our collected dataset of situations, the human evaluation results, and the code of our testing framework, i.e., Emotion Bench, are publicly available at https://github.com/CUHK-ARISE/Emotion Bench.
Open Datasets	Yes	Our collected dataset of situations, the human evaluation results, and the code of our testing framework, i.e., Emotion Bench, are publicly available at https://github.com/CUHK-ARISE/Emotion Bench. [...] A human baseline is established through a user study involving 1,266 annotators from different ethnics, genders, regions, age groups, etc. [...] Our Emotion Bench (1,266 human responses) is split into 866 samples for fine-tuning and 400 for testing.
Dataset Splits	No	The paper mentions splitting data into fine-tuning (training) and testing sets for model alignment: 'Our Emotion Bench (1,266 human responses) is split into 866 samples for fine-tuning and 400 for testing.' However, it does not explicitly mention a separate validation split for model tuning.
Hardware Specification	No	The paper mentions using OpenAI APIs and specific LLM models (e.g., GPT-4, Mixtral-8x22B) but does not specify the hardware (e.g., GPU models, CPU types) used to run the experiments or interact with these APIs.
Software Dependencies	No	The paper describes using specific LLM models and APIs, but does not list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, TensorFlow) required to reproduce their framework or experiments.
Experiment Setup	Yes	We set the temperature parameter to 0 and Top-P to 1 for all models to obtain more deterministic and reproducible results. [...] The following hyperparameters are used: n_epochs = 3, batch_size = 1, and learning_rate_multiplier = 2 for GPT-3.5-Turbo, and learning_rate = 5 10 5, per_device_train_batch_size = 2, and num_train_epochs = 3 for LLa MA-3.1-8B.