ROCK: Causal Inference Principles for Reasoning about Commonsense Causality

Authors: Jiayao Zhang, Hongming Zhang, Weijie Su, Dan Roth

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Empirical Studies We put the ROCK framework into action4, our findings reveal that although temporality is essential for CCR, without balancing covariates, it is prone to spurious correlations. 5.1. Setup and Details Evaluation Datasets. We evaluate the ROCK framework on the Choice of Plausible Alternatives dataset (COPA, Gordon et al., 2012) and a self-constructed dataset of 153 instances using the first dimension (cause-and-effect) of GLUCOSE (GLUCOSE-D1, Mostafazadeh et al., 2020).
Researcher Affiliation Collaboration 1Cognitive Computation Group, University of Pennsylvania, USA. 2Department of Statistics and Data Science, University of Pennsylvania, USA. 3Tencent AI Lab Seattle, USA. 4Amazon AWS AI Labs, USA.
Pseudocode No The paper describes the framework and its components in text and with a diagram (Figure 2), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code for the ROCK and for reproducing all results in this paper is available at github.com:zjiayao/ccr_rock.git.
Open Datasets Yes We evaluate the ROCK framework on the Choice of Plausible Alternatives dataset (COPA, Gordon et al., 2012) and a self-constructed dataset of 153 instances using the first dimension (cause-and-effect) of GLUCOSE (GLUCOSE-D1, Mostafazadeh et al., 2020).
Dataset Splits Yes We evaluate the development set of 100 instances (COPA-DEV) and the test set of 500 instances (COPA-TEST). To construct GLUCOSE-D1, we take the test set and set the cause as premise, the effect and another candidate event as two choices then follow the same procedure.
Hardware Specification No The paper mentions models like GPT-J and RoBERTa, but does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for experiments.
Software Dependencies No The paper mentions using 'GPT-J model', 'Hugging Face Transformers', 'Allen NLP s BERT SRL model', and 'PolyJuice', but does not provide specific version numbers for these software components or programming languages used.
Experiment Setup Yes We set max length of returned sequences to be 30, temperature to be 0.9. ... We choose a batch size of 500 and a learning rate of 5e-5, and train the model to convergence, which was around 135000 iterations when the loss converges to 1.37 from 2.02. ... for some fixed threshold ϵ and p ∈ {1, 2}, we will use following estimating equation for the Lp-balanced score...