reproducibilityindex.ai

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

Authors: Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, Xuedong Huang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present our empirical results in this section. Each of our three external knowledge sources can boost the commonsense reasoning performance, and combining all the three techniques helps us reach the human parity on the Commonsense QA benchmark.
Researcher Affiliation	Industry	Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng and Xuedong Huang Microsoft Corporation {yicxu,chezhu,shuowa,siqi.sun,chehao,xiaodl,jfgao,penhe,nzeng,xdh}@microsoft.com
Pseudocode	No	The paper includes a diagram (Figure 1) illustrating the proposed method, but it does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We focus on the Commonsense QA (CSQA, Talmor et al., 2019) benchmark. Commonsense QA is a widely used multiple-choice question answering dataset that requires commonsense knowledge. It contains 12k questions created using Concept Net [Speer et al., 2017].
Dataset Splits	Yes	We train the model for 10 epochs and take the best result on the dev set.
Hardware Specification	No	The paper states 'The batch size is set to 48 or smaller to fit the batch onto a single GPU,' but it does not specify the model or type of GPU, CPU, or any other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions using the 'Adam W optimizer' and specific model variants like 'De BERTa v2 model' and 'De BERTa V3 model'. However, it does not provide specific version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We finetune the model using the Adam W optimizer [Loshchilov and Hutter, 2017]. The batch size is set to 48 or smaller... We train the model for 10 epochs... We choose the weight decay in {0, 0.01, 0.1}. The learning rate are chosen from {1e 5, 2e 5, 3e 6} for all encoders except for De BERTa...chosen from {4e 6, 6e 6, 9e 6}. For VAT, we choose the weight multiplier α {0.1, 1.0, 10.0} and set input variation norm ε = 1e 5... We set number of retrieved questions M = 10.