WikiWhy: Answering and Explaining Cause-and-Effect Questions
Authors: Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art generative models struggle with producing satisfying explanations for WIKIWHY cause-effect relations. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of understanding in certain relations. Our key contributions are thus: ... We perform experiments on state-of-the-art, generative models to investigate various settings and establish baseline results with sizable room for improvement. |
| Researcher Affiliation | Academia | Matthew Ho , Aditya Sharma , Justin Chang , Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang Department of Computer Science, University of California, Santa Barbara, USA {msho, aditya sharma, justin chang}@ucsb.edu, {saxon, sharonlevy, yujielu}@ucsb.edu, william@cs.ucsb.edu |
| Pseudocode | No | The paper describes the data collection process and experimental setups in detail but does not include any pseudocode or algorithm blocks for the proposed methods. |
| Open Source Code | Yes | We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script. |
| Open Datasets | Yes | We publically release our dataset and codebase at https://github.com/matt-seb-ho/ Wiki Why containing the model tuning procedures, settings, few-shot prompts, and evaluation script. |
| Dataset Splits | Yes | We also fine-tune a Fusion-in-Decoder (Fi D) (Izacard & Grave, 2020) model (80-10-10 split; default configurations)... We train GPT-2 for ten epochs using the training split ( 80% of the data) |
| Hardware Specification | No | The paper details the models used (GPT-2, GPT-3, RoBERTa, Big Bird, FiD) and their training configurations, but it does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software tools and libraries like Pyserini, GPT-3, and GPT-2, and describes the Adam optimizer parameters, but it does not provide specific version numbers for the programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software dependencies. |
| Experiment Setup | Yes | We train GPT-2 for ten epochs using the training split ( 80% of the data) and Adam (Kingma & Ba, 2014) optimizer with standard hyperparameters (learning rate: γ = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-8, decay: λ = 0). For this tuned model we introduce special delimiter tokens <cause>, <effect>, and <explanation> in addition to the beginning and end tokens <bos> and <eos>. |