WikiWhy: Answering and Explaining Cause-and-Effect Questions

Authors: Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art generative models struggle with producing satisfying explanations for WIKIWHY cause-effect relations. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of understanding in certain relations. Our key contributions are thus: ... We perform experiments on state-of-the-art, generative models to investigate various settings and establish baseline results with sizable room for improvement.
Researcher Affiliation Academia Matthew Ho , Aditya Sharma , Justin Chang , Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang Department of Computer Science, University of California, Santa Barbara, USA {msho, aditya sharma, justin chang}@ucsb.edu, {saxon, sharonlevy, yujielu}@ucsb.edu, william@cs.ucsb.edu
Pseudocode No The paper describes the data collection process and experimental setups in detail but does not include any pseudocode or algorithm blocks for the proposed methods.
Open Source Code Yes We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script.
Open Datasets Yes We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script.
Dataset Splits Yes We also fine-tune a Fusion-in-Decoder (Fi D) (Izacard & Grave, 2020) model (80-10-10 split; default configurations)... We train GPT-2 for ten epochs using the training split ( 80% of the data)
Hardware Specification No The paper details the models used (GPT-2, GPT-3, RoBERTa, Big Bird, FiD) and their training configurations, but it does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software tools and libraries like Pyserini, GPT-3, and GPT-2, and describes the Adam optimizer parameters, but it does not provide specific version numbers for the programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software dependencies.
Experiment Setup Yes We train GPT-2 for ten epochs using the training split ( 80% of the data) and Adam (Kingma & Ba, 2014) optimizer with standard hyperparameters (learning rate: γ = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-8, decay: λ = 0). For this tuned model we introduce special delimiter tokens <cause>, <effect>, and <explanation> in addition to the beginning and end tokens <bos> and <eos>.