reproducibilityindex.ai

Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Authors: George Stoica, Emmanouil Antonios Platanios, Barnabas Poczos13843-13850

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After veriﬁcation, we observed that 23.9% of TACRED labels are incorrect. Moreover, evaluating several models on our revised dataset yields an average f1-score improvement of 14.3% and helps uncover signiﬁcant relationships between the different models (rather than simply offsetting or scaling their scores by a constant factor).
Researcher Affiliation	Collaboration	George Stoica 1, Emmanouil Antonios Platanios 2, Barnabas Poczos 1 1 Carnegie Mellon University 2 Microsoft Semantic Machines gis@cs.cmu.edu, emplata@microsoft.com, bapoczos@cs.cmu.edu
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	No	We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). Due to licensing restrictions, we cannot release complete dataset, but similar to Alt, Gabryszak, and Hennig (2020), we release a patch that contains all of our revisions. We term the corrected dataset Revised-TACRED (Re-TACRED).
Open Datasets	Yes	Finally, aside from our analysis we also release Re-TACRED, a new completely re-annotated version of the TACRED dataset that can be used to perform reliable evaluation of relation extraction models. ... We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED).
Dataset Splits	Yes	TACRED consists of over 106,000 sentences collected from the 2009-2014 TAC knowledge base population (KBP) evaluations, with those between 2009-2012 used for training, 2013 for development, and 2014 for testing.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned.
Experiment Setup	Yes	All results were reported using micro-averaged f1-scores from the model with the median validation f1-score over ﬁve independent runs, as in prior literature.