A Large-Scale Dataset for Argument Quality Ranking: Construction and Analysis
Authors: Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, Noam Slonim7805-7813
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we explore the challenging task of argument quality ranking. To this end, we created a corpus of 30,497 arguments carefully annotated for point-wise quality, released as part of this work. Moreover, we address the core issue of inducing a labeled score from crowd annotations by performing a comprehensive evaluation of different approaches to this problem. In addition, we analyze the quality dimensions that characterize this dataset. Finally, we present a neural method for argument quality ranking, which outperforms several baselines on our own dataset, as well as previous methods published for another dataset. |
| Researcher Affiliation | Industry | Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, Noam Slonim IBM Research {avishaig, roni.friedman-melamed, noams}@il.ibm.com {edo.cohen, assaf.toledo, dan.lahav, ranit.aharonov}@ibm.com |
| Pseudocode | No | The paper describes the methods in prose (e.g., BERT-Vanilla, BERT-Finetune) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'We release this dataset as part of this work.3 http://ibm.biz/debater-datasets'. This link is for the dataset, not the source code for the methodology. |
| Open Datasets | Yes | A major contribution of this work is introducing a novel dataset of arguments, carefully annotated for point-wise quality, IBM-Arg Q-Rank-30k Args, referred henceforth as IBM-Rank-30k. The dataset includes around 30k arguments, 5 times larger than the largest annotated point-wise data released to date (Toledo et al. 2019)... We release this dataset as part of this work.3 http://ibm.biz/debater-datasets |
| Dataset Splits | Yes | For the purpose of evaluating our methods on the IBM-Rank-30k dataset, we split its 71 topics to 49 topics for training, 7 for tuning hyper-parameters and determining early stopping (dev set) and 15 for test. |
| Hardware Specification | No | The paper does not specify any hardware details like CPU, GPU models, or memory used for experiments. |
| Software Dependencies | No | The paper mentions software like 'scikit-learn toolkit', 'BERT', 'ELMo', 'GloVe' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All models were trained for 5 epochs over the training data, taking the best checkpoint according to the performance on the dev set, with a batch size of 32 and a learning rate of 2e-5. |