Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Authors: George Stoica, Emmanouil Antonios Platanios, Barnabas Poczos13843-13850

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After verification, we observed that 23.9% of TACRED labels are incorrect. Moreover, evaluating several models on our revised dataset yields an average f1-score improvement of 14.3% and helps uncover significant relationships between the different models (rather than simply offsetting or scaling their scores by a constant factor).
Researcher Affiliation Collaboration George Stoica 1, Emmanouil Antonios Platanios 2, Barnabas Poczos 1 1 Carnegie Mellon University 2 Microsoft Semantic Machines gis@cs.cmu.edu, emplata@microsoft.com, bapoczos@cs.cmu.edu
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code No We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED). Due to licensing restrictions, we cannot release complete dataset, but similar to Alt, Gabryszak, and Hennig (2020), we release a patch that contains all of our revisions. We term the corrected dataset Revised-TACRED (Re-TACRED).
Open Datasets Yes Finally, aside from our analysis we also release Re-TACRED, a new completely re-annotated version of the TACRED dataset that can be used to perform reliable evaluation of relation extraction models. ... We release our newly corrected TACRED labels publicly online (https://github.com/ gstoica27/Re-TACRED).
Dataset Splits Yes TACRED consists of over 106,000 sentences collected from the 2009-2014 TAC knowledge base population (KBP) evaluations, with those between 2009-2012 used for training, 2013 for development, and 2014 for testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned.
Experiment Setup Yes All results were reported using micro-averaged f1-scores from the model with the median validation f1-score over five independent runs, as in prior literature.