Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-EVAL. |
| Researcher Affiliation | Academia | Language Technologies Institute, Carnegie Mellon University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our contributions are as follows: (A) We introduce and will release SOTOPIA, a general-domain interactive environment for simulating goal-oriented social interactions. (first page also contains URL: https://sotopia.world) |
| Open Datasets | Yes | To generate high-quality scenarios with enough coverage of different types of social interactions (as shown in Figure 1), we randomly sample data from previous datasets, including Forbes et al. 2020; Sap et al. 2019; Lewis et al. 2017; Ziems et al. 2023; He et al. 2018; 2017, and use them in the prompts to inspire GPT-4. |
| Dataset Splits | No | The paper describes generating tasks/episodes for evaluation but does not specify training, validation, or test dataset splits for model development or evaluation reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU/CPU models or memory. |
| Software Dependencies | Yes | Specifically, we consider the following models for comparison: GPT-3.5 (Ouyang et al., 2022), GPT-4 (Open AI, 2023), Llama-2-70b-chat (Touvron et al., 2023), and MPT-30b-chat (Mosaic ML NLP Team, 2023). |
| Experiment Setup | Yes | We set the temperature of the agents to 1 to encourage diversity of responses, and the temperature of the evaluator to 0 to ensure the stability of the evaluation. We use a fixed version of the above models to help reproducibility. |