Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Authors: Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, Xuedong Huang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our empirical results in this section. Each of our three external knowledge sources can boost the commonsense reasoning performance, and combining all the three techniques helps us reach the human parity on the Commonsense QA benchmark. |
| Researcher Affiliation | Industry | Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng and Xuedong Huang Microsoft Corporation {yicxu,chezhu,shuowa,siqi.sun,chehao,xiaodl,jfgao,penhe,nzeng,xdh}@microsoft.com |
| Pseudocode | No | The paper includes a diagram (Figure 1) illustrating the proposed method, but it does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We focus on the Commonsense QA (CSQA, Talmor et al., 2019) benchmark. Commonsense QA is a widely used multiple-choice question answering dataset that requires commonsense knowledge. It contains 12k questions created using Concept Net [Speer et al., 2017]. |
| Dataset Splits | Yes | We train the model for 10 epochs and take the best result on the dev set. |
| Hardware Specification | No | The paper states 'The batch size is set to 48 or smaller to fit the batch onto a single GPU,' but it does not specify the model or type of GPU, CPU, or any other detailed hardware specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer' and specific model variants like 'De BERTa v2 model' and 'De BERTa V3 model'. However, it does not provide specific version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We finetune the model using the Adam W optimizer [Loshchilov and Hutter, 2017]. The batch size is set to 48 or smaller... We train the model for 10 epochs... We choose the weight decay in {0, 0.01, 0.1}. The learning rate are chosen from {1e 5, 2e 5, 3e 6} for all encoders except for De BERTa...chosen from {4e 6, 6e 6, 9e 6}. For VAT, we choose the weight multiplier α {0.1, 1.0, 10.0} and set input variation norm ε = 1e 5... We set number of retrieved questions M = 10. |