Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Consistent Inference for Dialogue Relation Extraction

Authors: Xinwei Long, Shuzi Niu, Yucheng Li

IJCAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on two benchmark datasets show that the F1 performance improvement of the proposed method is at least 3.3% compared with SOTA. We conduct comprehensive experiments on two benchmark datasets, Dialog RE [Yu et al., 2020] and MPDD [Chen et al., 2020b], and Co In shows the 3.3% and 6.2% improvement in terms of F1 (Dialog RE) and accuracy (MPDD) than state-of-the-art models. Ablation studies prove the effectiveness of each module.
Researcher Affiliation	Academia	1Institute of Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences EMAIL, EMAIL,
Pseudocode	No	The paper includes an architecture diagram (Figure 2) but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source codes and pre-processed data are released in https://github.com/xinwei96/Co In dialog RE
Open Datasets	Yes	Datasets. (1) Dialog RE [Yu et al., 2020]. We follow the standard settings offered by the original paper, and deploy F1 score as the metric. (2) MPDD [Chen et al., 2020b]. More details of Dialog RE and processed MPDD can be found in Table 1. Source codes and pre-processed data are released in https://github.com/xinwei96/Co In dialog RE
Dataset Splits	Yes	Table 1: Dataset Statistics. Dialog Num. 1073 / 358 / 357, Relation Num. 4992 / 1597 / 1529. (These numbers represent train / dev / test splits, where 'dev' typically serves as the validation set).
Hardware Specification	Yes	Experiments are conducted on a sever with a Ge Force GTX 1080Ti GPU, 64G memory.
Software Dependencies	Yes	Our model was implemented by Pytorch with CUDA 11.0.
Experiment Setup	Yes	We adopt BERT-base architecture with the ﬁne-tuning learning rate of 2e 5. We use a self-attention layer with dropout 0.2 and learning rate 5e 4. The number of windows K is set to 2 from {i}4 i=1. We use Adam W [Loshchilov and Hutter, 2019] as optimizer with Cosine Annealing scheduler [Loshchilov and Hutter, 2017]. The threshold τ of multi-label classiﬁer, Trade-off parameters λ1 and λ2 are set to 0.51.