reproducibilityindex.ai

Zero-Shot Commonsense Question Answering with Cloze Translation and Consistency Optimization

Authors: Zi-Yi Dou, Nanyun Peng10572-10580

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our methods on three CQA datasets in zeroshot settings. We show that our methods are complementary to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
Researcher Affiliation	Academia	Zi-Yi Dou, Nanyun Peng University of California, Los Angeles {zdou,violetpeng}@cs.ucla.edu
Pseudocode	Yes	Algorithm 1: Our syntactic-based rewriting method ( SQ is defined as the subconstituent of questions excluding wh-word or wh-phrase)
Open Source Code	Yes	Code/dataset is available at https://github.com/Plus Lab NLP/zero shot cqa.
Open Datasets	Yes	We experiment on three CQA datasets, including Commonsense QA (Talmor et al. 2019), Openbook QA (Mihaylov et al. 2018), and Social IQA (Sap et al. 2019b).
Dataset Splits	Yes	For the Commonsense QA dataset (Talmor et al. 2019), because its test set is not publicly available, the predictions for it can only be evaluated once every two weeks via the official leaderboard. Therefore, following previous work (Lin et al. 2019; Wang et al. 2020), we separate the training data into training and test sets consisting of 8,500 and 1,241 instances respectively. We use the standard development set consisting of 1,221 instances. The Openbook QA (Mihaylov et al. 2018) dataset consists of 5,957 multiple-choice questions with 4,957 training, 500 development, 500 testing instances. While it provides a small book of 1,326 core science facts, we do not include this additional information because our focus is on the implicitly learned knowledge in pre-trained language models. The Social IQA (Sap et al. 2019b) dataset contains 33,410 training, 1,954 development, 2,224 testing instances, the aim of which is to probe the emotional and social intelligence of models in a variety of everyday situations.
Hardware Specification	No	The paper mentions the language models used (e.g., ALBERT-xxlarge-v2, BART-Large, RoBERTa) but does not specify any hardware details such as GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies	No	The paper names various models and tools (e.g., BART-Large, GECToR, ALBERT-xxlarge-v2) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	For the seq2seq model, we follow the setting in text summarization on XSUM and finetune the BART-Large model on the training set of our cloze data for 15k steps with a batch size of 16,384 tokens and a learning rate of 3e-5.