reproducibilityindex.ai

PIQA: Reasoning about Physical Commonsense in Natural Language

Authors: Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, Yejin Choi7432-7439

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans ﬁnd the dataset easy (95% accuracy), large pretrained models struggle ( 75%). We present our results in Table 1. In this section, we test the performance of state-of-the-art natural language understanding models on our dataset, PIQA.
Researcher Affiliation	Collaboration	Yonatan Bisk,1,2,3,4 Rowan Zellers,1,4 Ronan Le Bras,1 Jianfeng Gao,2 Yejin Choi1,4 1Allen Institute for Artiﬁcial Intelligence 2Microsoft Research AI 3Carnegie Mellon University 4Paul G. A llen School for Computer Science and Engineering, University of Washington
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a project website (http://yonatanbisk.com/piqa) but does not explicitly state that the source code for their methodology is released or provide a direct link to a code repository for it.
Open Datasets	Yes	In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. We introduce a new dataset, PIQA, for benchmarking progress in physical commonsense understanding. The paper's abstract also includes the URL: http://yonatanbisk.com/piqa
Dataset Splits	Yes	In total our dataset is comprised of over 16,000 training QA pairs with an additional 2K and 3k held out for development and testing, respectively. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring conﬁguration as was found on the validation set.
Hardware Specification	No	Computations on beaker.org were supported in part by Google Cloud. This mentions a cloud service but does not specify any particular hardware components like GPU or CPU models.
Software Dependencies	No	For all models and experiments, we used the transformers library and truncated examples at 150 tokens... Spacy,3 average 7.8 words. The paper mentions software libraries ('transformers library' and 'Spacy') but does not provide specific version numbers for them.
Experiment Setup	Yes	For all models and experiments, we used the transformers library and truncated examples at 150 tokens, which affects 1% of the data. We follow best practices in using a grid search over learning rates, batch sizes, and the number of training epochs for each model, and report the bestscoring conﬁguration as was found on the validation set.