reproducibilityindex.ai

Evaluating Commonsense in Pre-Trained Language Models

Authors: Xuhui Zhou, Yue Zhang, Leyang Cui, Dandan Huang9733-9740

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the commonsense ability of GPT, BERT, XLNet, and Ro BERTa by testing them on seven challenging benchmarks, ﬁnding that language modeling and its variants are effective objectives for promoting models commonsense ability while bi-directional context and larger training set are bonuses. We additionally ﬁnd that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other.
Researcher Affiliation	Academia	Xuhui Zhou,1 Yue Zhang,2 Leyang Cui,2,3 Dandan Huang2 1University of Washington 2School of Engineering, Westlake University 3Zhejiang University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We publicly release our datasets, named commonsense ability tests (CATs), and the test script at Git Hub.2 [2https://github.com/Xuhui Zhou/CATS]
Open Datasets	Yes	We synthesize six challenging tasks by taking positive and negative samples from existing benchmarks, and further introduce a new task called Conjunction Acceptability (CA). We integrated the above test sets into a commonsense ability test (CATs) benchmark, released for future research.
Dataset Splits	No	The paper does not specify training, validation, or test dataset splits for the evaluation tasks themselves, only describes how positive/negative samples are created for testing.
Hardware Specification	No	The paper does not explicitly describe the hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies	No	The paper mentions using 'off-the-shelf embeddings' from various models (GPT, BERT, XLNet, RoBERTa) and references 'huggingface/transformers' but does not specify exact version numbers for software dependencies.
Experiment Setup	Yes	We derive the score of a sentence below with uni-directional-context LMs and bi-directional-context LMs, respectively. Formally, suppose the sentence S of n words S = {w1, ..., wk 1, wk, wk+1, ..., wn}. We deﬁne the score of a sentence as: Score(S) = n k=1 log(Pθ(wk\|contextk) where the denominator n is for alleviating the inﬂuence of the sentence length to models prediction, especially in sentence-level tasks. For a uni-directional model, contextk = S<k {w1, ..., wk 1}. For a bi-directional model, the contextk = S k, which represents the S with the k-th word being removed.