MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
Authors: Allen Nie, Yuhui Zhang, Atharva Shailesh Amdekar, Chris Piech, Tatsunori B. Hashimoto, Tobias Gerstenberg
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. |
| Researcher Affiliation | Academia | 1Computer Science 2ICME 3Psychology Stanford University |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm block. |
| Open Source Code | Yes | We have released the code and dataset here: https://github.com/cicl-stanford/moca |
| Open Datasets | Yes | We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. ... We have released the code and dataset here: https://github.com/cicl-stanford/moca |
| Dataset Splits | No | The paper mentions collecting responses for stories and details about the dataset but does not specify explicit training/validation/test dataset splits for model reproduction, only referring to stories from existing literature. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions various LLMs by name (e.g., RoBERTa-large, GPT2-XL, GPT3-babbage-v1, Anthropic-claude-v1, GPT3.5-turbo, GPT-4) and notes they were accessed via API. However, it does not specify software dependencies with version numbers for reproducibility, such as Python versions or library versions (e.g., PyTorch, TensorFlow). |
| Experiment Setup | No | The paper describes various experimental setups like 'Zero-shot Alignment', 'Social Simulacra', and 'Automatic Prompt Optimization' and provides details on how models were prompted and evaluated (e.g., '3-class comparison', 'normalize the model output probability'). However, it does not explicitly provide concrete hyperparameter values or detailed system-level training settings for the models evaluated, as these were mostly accessed via APIs. |