Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

Authors: Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, Xuedong Huang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our empirical results in this section. Each of our three external knowledge sources can boost the commonsense reasoning performance, and combining all the three techniques helps us reach the human parity on the Commonsense QA benchmark.
Researcher Affiliation Industry Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng and Xuedong Huang Microsoft Corporation {yicxu,chezhu,shuowa,siqi.sun,chehao,xiaodl,jfgao,penhe,nzeng,xdh}@microsoft.com
Pseudocode No The paper includes a diagram (Figure 1) illustrating the proposed method, but it does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We focus on the Commonsense QA (CSQA, Talmor et al., 2019) benchmark. Commonsense QA is a widely used multiple-choice question answering dataset that requires commonsense knowledge. It contains 12k questions created using Concept Net [Speer et al., 2017].
Dataset Splits Yes We train the model for 10 epochs and take the best result on the dev set.
Hardware Specification No The paper states 'The batch size is set to 48 or smaller to fit the batch onto a single GPU,' but it does not specify the model or type of GPU, CPU, or any other detailed hardware specifications used for the experiments.
Software Dependencies No The paper mentions using the 'Adam W optimizer' and specific model variants like 'De BERTa v2 model' and 'De BERTa V3 model'. However, it does not provide specific version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We finetune the model using the Adam W optimizer [Loshchilov and Hutter, 2017]. The batch size is set to 48 or smaller... We train the model for 10 epochs... We choose the weight decay in {0, 0.01, 0.1}. The learning rate are chosen from {1e 5, 2e 5, 3e 6} for all encoders except for De BERTa...chosen from {4e 6, 6e 6, 9e 6}. For VAT, we choose the weight multiplier α {0.1, 1.0, 10.0} and set input variation norm ε = 1e 5... We set number of retrieved questions M = 10.