SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-EVAL. |
| Researcher Affiliation | Academia | Language Technologies Institute, Carnegie Mellon University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our contributions are as follows: (A) We introduce and will release SOTOPIA, a general-domain interactive environment for simulating goal-oriented social interactions. (first page also contains URL: https://sotopia.world) |
| Open Datasets | Yes | To generate high-quality scenarios with enough coverage of different types of social interactions (as shown in Figure 1), we randomly sample data from previous datasets, including Forbes et al. 2020; Sap et al. 2019; Lewis et al. 2017; Ziems et al. 2023; He et al. 2018; 2017, and use them in the prompts to inspire GPT-4. |
| Dataset Splits | No | The paper describes generating tasks/episodes for evaluation but does not specify training, validation, or test dataset splits for model development or evaluation reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU/CPU models or memory. |
| Software Dependencies | Yes | Specifically, we consider the following models for comparison: GPT-3.5 (Ouyang et al., 2022), GPT-4 (Open AI, 2023), Llama-2-70b-chat (Touvron et al., 2023), and MPT-30b-chat (Mosaic ML NLP Team, 2023). |
| Experiment Setup | Yes | We set the temperature of the agents to 1 to encourage diversity of responses, and the temperature of the evaluator to 0 to ensure the stability of the evaluation. We use a fixed version of the above models to help reproducibility. |