reproducibilityindex.ai

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks

Authors: Allen Nie, Yuhui Zhang, Atharva Shailesh Amdekar, Chris Piech, Tatsunori B. Hashimoto, Tobias Gerstenberg

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants.
Researcher Affiliation	Academia	1Computer Science 2ICME 3Psychology Stanford University
Pseudocode	No	The paper does not contain a pseudocode block or a clearly labeled algorithm block.
Open Source Code	Yes	We have released the code and dataset here: https://github.com/cicl-stanford/moca
Open Datasets	Yes	We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. ... We have released the code and dataset here: https://github.com/cicl-stanford/moca
Dataset Splits	No	The paper mentions collecting responses for stories and details about the dataset but does not specify explicit training/validation/test dataset splits for model reproduction, only referring to stories from existing literature.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions various LLMs by name (e.g., RoBERTa-large, GPT2-XL, GPT3-babbage-v1, Anthropic-claude-v1, GPT3.5-turbo, GPT-4) and notes they were accessed via API. However, it does not specify software dependencies with version numbers for reproducibility, such as Python versions or library versions (e.g., PyTorch, TensorFlow).
Experiment Setup	No	The paper describes various experimental setups like 'Zero-shot Alignment', 'Social Simulacra', and 'Automatic Prompt Optimization' and provides details on how models were prompted and evaluated (e.g., '3-class comparison', 'normalize the model output probability'). However, it does not explicitly provide concrete hyperparameter values or detailed system-level training settings for the models evaluated, as these were mostly accessed via APIs.