reproducibilityindex.ai

Fine-Grained Argument Unit Recognition and Classification

Authors: Dietrich Trautmann, Johannes Daxenberger, Christian Stab, Hinrich Schütze, Iryna Gurevych9048-9056

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a dataset of arguments from heterogeneous sources annotated as spans of tokens within a sentence, as well as with a corresponding stance. We show that and how such difﬁcult argument annotations can be effectively collected through crowdsourcing with high interannotator agreement. The new benchmark, AURC-8, contains up to 15% more arguments per topic as compared to annotations on the sentence level. We identify a number of methods targeted at AURC sequence labeling, achieving close to human performance on known domains. Further analysis also reveals that, contrary to previous approaches, our methods are more robust against sentence segmentation errors. We publicly release our code and the AURC-8 dataset.
Researcher Affiliation	Academia	Center for Information and Language Processing (CIS), LMU Munich, Germany Ubiquitous Knowledge Processing Lab (UKP-TUDA), TU Darmstadt, Germany
Pseudocode	No	The paper does not include pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	We publicly release our code and the AURC-8 dataset.1 1https://github.com/trtm/AURC
Open Datasets	Yes	We publicly release our code and the AURC-8 dataset.1 1https://github.com/trtm/AURC
Dataset Splits	Yes	As a result, there are 4000 samples in train, 800 in dev and 2000 in test for the cross-domain split; and 4200 samples in train, 600 in dev and 1200 in test for the in-domain split.
Hardware Specification	No	BERTBASE, which requires only one GPU for training, is a good option if computational resources are limited. (No specific GPU model, processor, or memory details are provided.)
Software Dependencies	No	The paper mentions software like FLAIR, BERT, spacy, Elasticsearch, justext, and Argumen Text Classify API, but it does not specify version numbers for these software dependencies, only general usage.
Experiment Setup	Yes	This section lists the hyperparameters used for the experimental systems described in the main part of the paper. For FLAIR in the token-level model, we used a learning rate of 1e-1 with gradual decreasing, hiddensize=256 and for the sentence-level model the same setting for the learning rate, but with hiddensize=512. For BERTLARGE and BERTLARGE+CRF, we used the large cased pretrained model with whole word masking and in the token-level setup a learning rate of 1e-5 for in and cross-domain. We kept the learning rate at 4e-5 for the sentence-level BERTLARGE model and at 1e-5 for the BERTLARGE+CRF and used the Adam W optimizer. The max. length of the tokenized BERT input was set to 64 tokens and we always had a dropout rate of 0.1. All experiments were run three times with different seeds, a trainings batch size of 32 and for a max. of 100 epochs, with earlier stopping if the performance/loss did not improve/decreased signiﬁcantly (after ten epochs).