PIQA: Reasoning about Physical Commonsense in Natural Language
Authors: Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, Yejin Choi7432-7439
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle ( 75%). We present our results in Table 1. In this section, we test the performance of state-of-the-art natural language understanding models on our dataset, PIQA. |
| Researcher Affiliation | Collaboration | Yonatan Bisk,1,2,3,4 Rowan Zellers,1,4 Ronan Le Bras,1 Jianfeng Gao,2 Yejin Choi1,4 1Allen Institute for Artificial Intelligence 2Microsoft Research AI 3Carnegie Mellon University 4Paul G. A llen School for Computer Science and Engineering, University of Washington |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website (http://yonatanbisk.com/piqa) but does not explicitly state that the source code for their methodology is released or provide a direct link to a code repository for it. |
| Open Datasets | Yes | In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. We introduce a new dataset, PIQA, for benchmarking progress in physical commonsense understanding. The paper's abstract also includes the URL: http://yonatanbisk.com/piqa |
| Dataset Splits | Yes | In total our dataset is comprised of over 16,000 training QA pairs with an additional 2K and 3k held out for development and testing, respectively. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring configuration as was found on the validation set. |
| Hardware Specification | No | Computations on beaker.org were supported in part by Google Cloud. This mentions a cloud service but does not specify any particular hardware components like GPU or CPU models. |
| Software Dependencies | No | For all models and experiments, we used the transformers library and truncated examples at 150 tokens... Spacy,3 average 7.8 words. The paper mentions software libraries ('transformers library' and 'Spacy') but does not provide specific version numbers for them. |
| Experiment Setup | Yes | For all models and experiments, we used the transformers library and truncated examples at 150 tokens, which affects 1% of the data. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring configuration as was found on the validation set. |