Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation
Authors: Ieva Staliūnaitė, Philip John Gorinski, Ignacio Iacobacci13834-13842
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both methods boost model performance on the Choice of Plausible Alternatives (COPA) dataset, as well as on a Balanced COPA dataset, which is a modified version of the original data that has been developed to avoid superficial cues, leading to a more challenging benchmark. We show a statistically significant improvement in performance and robustness on both datasets, even with only a small number of additionally generated data points. |
| Researcher Affiliation | Industry | Ieva Stali unait e, Philip John Gorinski, Ignacio Iacobacci Huawei Noah s Ark Lab, London, United Kingdom {ieva.staliunaite | philip.john.gorinski | ignacio.iacobacci}@huawei.com |
| Pseudocode | No | The paper describes its methods in prose and includes a diagram of the model architecture (Figure 1), but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper mentions using "the implementation of Ro BERTa (Liu et al. 2019) provided by Huggingface2" with a footnote link to huggingface.co, but there is no explicit statement or link for the authors' own source code for the methodology or experiments described in the paper. |
| Open Datasets | Yes | The most commonly used benchmark task for evaluation of commonsense reasoning models is the the Choice of Plausible Alternatives (Roemmele, Bejan, and Gordon 2011, COPA). ... Consequently, Kavumba et al. (2019) introduce a Balanced COPA dataset by manually adjusting items from COPA to remove the superficial features... We therefore opt to use the recently published Open Web Text corpus 5, itself derived from a nonopen dataset introduced in Radford et al. (2019). Open Web Text contains 40GB of text from over 8 million documents, spanning a plethora of resources and domains. 5http://Skylion007.github.io/Open Web Text Corpus |
| Dataset Splits | Yes | We train for a maximum of 50 epochs, stopping early when performance on the development set ceases to improve. We average the model performance on the development data over the results of the 20 seeds per learning rate, remove the bottom and top two outliers, and evaluate those models trained with the best-performing learning rate on the test set. ... Table 1 summarizes the results of model performance. The adversarially enhanced models outperform the models trained on the original data alone in terms of both average performance and standard deviation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Ro BERTa (Liu et al. 2019) provided by Huggingface", "Penn Discourse Tree Bank (Prasad et al. 2008, PDTB) parser (Lin, Ng, and Kan 2014)", "GPT-2 (Radford et al. 2019)", and "Word Net (Fellbaum 2012)" but does not specify version numbers for these software components or underlying libraries like PyTorch, TensorFlow, etc. |
| Experiment Setup | Yes | Similarly to Kavumba et al. (2019), we use the learning rates of 1e 6, 2e 6 and 3e 6, 20 different seeds per learning rate, weight decay of 0.01, and a batch size of 32. We train for a maximum of 50 epochs, stopping early when performance on the development set ceases to improve. |