PIQA: Reasoning about Physical Commonsense in Natural Language

Authors: Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, Yejin Choi7432-7439

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle ( 75%). We present our results in Table 1. In this section, we test the performance of state-of-the-art natural language understanding models on our dataset, PIQA.
Researcher Affiliation Collaboration Yonatan Bisk,1,2,3,4 Rowan Zellers,1,4 Ronan Le Bras,1 Jianfeng Gao,2 Yejin Choi1,4 1Allen Institute for Artificial Intelligence 2Microsoft Research AI 3Carnegie Mellon University 4Paul G. A llen School for Computer Science and Engineering, University of Washington
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website (http://yonatanbisk.com/piqa) but does not explicitly state that the source code for their methodology is released or provide a direct link to a code repository for it.
Open Datasets Yes In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. We introduce a new dataset, PIQA, for benchmarking progress in physical commonsense understanding. The paper's abstract also includes the URL: http://yonatanbisk.com/piqa
Dataset Splits Yes In total our dataset is comprised of over 16,000 training QA pairs with an additional 2K and 3k held out for development and testing, respectively. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring configuration as was found on the validation set.
Hardware Specification No Computations on beaker.org were supported in part by Google Cloud. This mentions a cloud service but does not specify any particular hardware components like GPU or CPU models.
Software Dependencies No For all models and experiments, we used the transformers library and truncated examples at 150 tokens... Spacy,3 average 7.8 words. The paper mentions software libraries ('transformers library' and 'Spacy') but does not provide specific version numbers for them.
Experiment Setup Yes For all models and experiments, we used the transformers library and truncated examples at 150 tokens, which affects 1% of the data. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring configuration as was found on the validation set.