Zero-Shot Commonsense Question Answering with Cloze Translation and Consistency Optimization

Authors: Zi-Yi Dou, Nanyun Peng10572-10580

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our methods on three CQA datasets in zeroshot settings. We show that our methods are complementary to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
Researcher Affiliation Academia Zi-Yi Dou, Nanyun Peng University of California, Los Angeles {zdou,violetpeng}@cs.ucla.edu
Pseudocode Yes Algorithm 1: Our syntactic-based rewriting method ( SQ is defined as the subconstituent of questions excluding wh-word or wh-phrase)
Open Source Code Yes Code/dataset is available at https://github.com/Plus Lab NLP/zero shot cqa.
Open Datasets Yes We experiment on three CQA datasets, including Commonsense QA (Talmor et al. 2019), Openbook QA (Mihaylov et al. 2018), and Social IQA (Sap et al. 2019b).
Dataset Splits Yes For the Commonsense QA dataset (Talmor et al. 2019), because its test set is not publicly available, the predictions for it can only be evaluated once every two weeks via the official leaderboard. Therefore, following previous work (Lin et al. 2019; Wang et al. 2020), we separate the training data into training and test sets consisting of 8,500 and 1,241 instances respectively. We use the standard development set consisting of 1,221 instances. The Openbook QA (Mihaylov et al. 2018) dataset consists of 5,957 multiple-choice questions with 4,957 training, 500 development, 500 testing instances. While it provides a small book of 1,326 core science facts, we do not include this additional information because our focus is on the implicitly learned knowledge in pre-trained language models. The Social IQA (Sap et al. 2019b) dataset contains 33,410 training, 1,954 development, 2,224 testing instances, the aim of which is to probe the emotional and social intelligence of models in a variety of everyday situations.
Hardware Specification No The paper mentions the language models used (e.g., ALBERT-xxlarge-v2, BART-Large, RoBERTa) but does not specify any hardware details such as GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies No The paper names various models and tools (e.g., BART-Large, GECToR, ALBERT-xxlarge-v2) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes For the seq2seq model, we follow the setting in text summarization on XSUM and finetune the BART-Large model on the training set of our cloze data for 15k steps with a batch size of 16,384 tokens and a learning rate of 3e-5.