reproducibilityindex.ai

WikiWhy: Answering and Explaining Cause-and-Effect Questions

Authors: Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art generative models struggle with producing satisfying explanations for WIKIWHY cause-effect relations. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of understanding in certain relations. Our key contributions are thus: ... We perform experiments on state-of-the-art, generative models to investigate various settings and establish baseline results with sizable room for improvement.
Researcher Affiliation	Academia	Matthew Ho , Aditya Sharma , Justin Chang , Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang Department of Computer Science, University of California, Santa Barbara, USA {msho, aditya sharma, justin chang}@ucsb.edu, {saxon, sharonlevy, yujielu}@ucsb.edu, william@cs.ucsb.edu
Pseudocode	No	The paper describes the data collection process and experimental setups in detail but does not include any pseudocode or algorithm blocks for the proposed methods.
Open Source Code	Yes	We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script.
Open Datasets	Yes	We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script.
Dataset Splits	Yes	We also fine-tune a Fusion-in-Decoder (Fi D) (Izacard & Grave, 2020) model (80-10-10 split; default configurations)... We train GPT-2 for ten epochs using the training split ( 80% of the data)
Hardware Specification	No	The paper details the models used (GPT-2, GPT-3, RoBERTa, Big Bird, FiD) and their training configurations, but it does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software tools and libraries like Pyserini, GPT-3, and GPT-2, and describes the Adam optimizer parameters, but it does not provide specific version numbers for the programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software dependencies.
Experiment Setup	Yes	We train GPT-2 for ten epochs using the training split ( 80% of the data) and Adam (Kingma & Ba, 2014) optimizer with standard hyperparameters (learning rate: γ = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-8, decay: λ = 0). For this tuned model we introduce special delimiter tokens <cause>, <effect>, and <explanation> in addition to the beginning and end tokens <bos> and <eos>.