A Large-Scale Dataset for Argument Quality Ranking: Construction and Analysis

Authors: Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, Noam Slonim7805-7813

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we explore the challenging task of argument quality ranking. To this end, we created a corpus of 30,497 arguments carefully annotated for point-wise quality, released as part of this work. Moreover, we address the core issue of inducing a labeled score from crowd annotations by performing a comprehensive evaluation of different approaches to this problem. In addition, we analyze the quality dimensions that characterize this dataset. Finally, we present a neural method for argument quality ranking, which outperforms several baselines on our own dataset, as well as previous methods published for another dataset.
Researcher Affiliation Industry Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, Noam Slonim IBM Research {avishaig, roni.friedman-melamed, noams}@il.ibm.com {edo.cohen, assaf.toledo, dan.lahav, ranit.aharonov}@ibm.com
Pseudocode No The paper describes the methods in prose (e.g., BERT-Vanilla, BERT-Finetune) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We release this dataset as part of this work.3 http://ibm.biz/debater-datasets'. This link is for the dataset, not the source code for the methodology.
Open Datasets Yes A major contribution of this work is introducing a novel dataset of arguments, carefully annotated for point-wise quality, IBM-Arg Q-Rank-30k Args, referred henceforth as IBM-Rank-30k. The dataset includes around 30k arguments, 5 times larger than the largest annotated point-wise data released to date (Toledo et al. 2019)... We release this dataset as part of this work.3 http://ibm.biz/debater-datasets
Dataset Splits Yes For the purpose of evaluating our methods on the IBM-Rank-30k dataset, we split its 71 topics to 49 topics for training, 7 for tuning hyper-parameters and determining early stopping (dev set) and 15 for test.
Hardware Specification No The paper does not specify any hardware details like CPU, GPU models, or memory used for experiments.
Software Dependencies No The paper mentions software like 'scikit-learn toolkit', 'BERT', 'ELMo', 'GloVe' but does not provide specific version numbers for these software components.
Experiment Setup Yes All models were trained for 5 epochs over the training data, taking the best checkpoint according to the performance on the dev set, with a batch size of 32 and a learning rate of 2e-5.