reproducibilityindex.ai

Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation

Authors: Ieva Staliūnaitė, Philip John Gorinski, Ignacio Iacobacci13834-13842

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both methods boost model performance on the Choice of Plausible Alternatives (COPA) dataset, as well as on a Balanced COPA dataset, which is a modiﬁed version of the original data that has been developed to avoid superﬁcial cues, leading to a more challenging benchmark. We show a statistically signiﬁcant improvement in performance and robustness on both datasets, even with only a small number of additionally generated data points.
Researcher Affiliation	Industry	Ieva Stali unait e, Philip John Gorinski, Ignacio Iacobacci Huawei Noah s Ark Lab, London, United Kingdom {ieva.staliunaite \| philip.john.gorinski \| ignacio.iacobacci}@huawei.com
Pseudocode	No	The paper describes its methods in prose and includes a diagram of the model architecture (Figure 1), but no structured pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper mentions using "the implementation of Ro BERTa (Liu et al. 2019) provided by Huggingface2" with a footnote link to huggingface.co, but there is no explicit statement or link for the authors' own source code for the methodology or experiments described in the paper.
Open Datasets	Yes	The most commonly used benchmark task for evaluation of commonsense reasoning models is the the Choice of Plausible Alternatives (Roemmele, Bejan, and Gordon 2011, COPA). ... Consequently, Kavumba et al. (2019) introduce a Balanced COPA dataset by manually adjusting items from COPA to remove the superﬁcial features... We therefore opt to use the recently published Open Web Text corpus 5, itself derived from a nonopen dataset introduced in Radford et al. (2019). Open Web Text contains 40GB of text from over 8 million documents, spanning a plethora of resources and domains. 5http://Skylion007.github.io/Open Web Text Corpus
Dataset Splits	Yes	We train for a maximum of 50 epochs, stopping early when performance on the development set ceases to improve. We average the model performance on the development data over the results of the 20 seeds per learning rate, remove the bottom and top two outliers, and evaluate those models trained with the best-performing learning rate on the test set. ... Table 1 summarizes the results of model performance. The adversarially enhanced models outperform the models trained on the original data alone in terms of both average performance and standard deviation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing specifications used for running the experiments.
Software Dependencies	No	The paper mentions using "Ro BERTa (Liu et al. 2019) provided by Huggingface", "Penn Discourse Tree Bank (Prasad et al. 2008, PDTB) parser (Lin, Ng, and Kan 2014)", "GPT-2 (Radford et al. 2019)", and "Word Net (Fellbaum 2012)" but does not specify version numbers for these software components or underlying libraries like PyTorch, TensorFlow, etc.
Experiment Setup	Yes	Similarly to Kavumba et al. (2019), we use the learning rates of 1e 6, 2e 6 and 3e 6, 20 different seeds per learning rate, weight decay of 0.01, and a batch size of 32. We train for a maximum of 50 epochs, stopping early when performance on the development set ceases to improve.