Evaluating Commonsense in Pre-Trained Language Models

Authors: Xuhui Zhou, Yue Zhang, Leyang Cui, Dandan Huang9733-9740

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the commonsense ability of GPT, BERT, XLNet, and Ro BERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other.
Researcher Affiliation Academia Xuhui Zhou,1 Yue Zhang,2 Leyang Cui,2,3 Dandan Huang2 1University of Washington 2School of Engineering, Westlake University 3Zhejiang University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We publicly release our datasets, named commonsense ability tests (CATs), and the test script at Git Hub.2 [2https://github.com/Xuhui Zhou/CATS]
Open Datasets Yes We synthesize six challenging tasks by taking positive and negative samples from existing benchmarks, and further introduce a new task called Conjunction Acceptability (CA). We integrated the above test sets into a commonsense ability test (CATs) benchmark, released for future research.
Dataset Splits No The paper does not specify training, validation, or test dataset splits for the evaluation tasks themselves, only describes how positive/negative samples are created for testing.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies No The paper mentions using 'off-the-shelf embeddings' from various models (GPT, BERT, XLNet, RoBERTa) and references 'huggingface/transformers' but does not specify exact version numbers for software dependencies.
Experiment Setup Yes We derive the score of a sentence below with uni-directional-context LMs and bi-directional-context LMs, respectively. Formally, suppose the sentence S of n words S = {w1, ..., wk 1, wk, wk+1, ..., wn}. We define the score of a sentence as: Score(S) = n k=1 log(Pθ(wk|contextk) where the denominator n is for alleviating the influence of the sentence length to models prediction, especially in sentence-level tasks. For a uni-directional model, contextk = S<k {w1, ..., wk 1}. For a bi-directional model, the contextk = S k, which represents the S with the k-th word being removed.