Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

De-biased Attention Supervision for Text Classification with Causality

Authors: Yiquan Wu, Yifei Liu, Ziyu Zhao, Weiming Lu, Yating Zhang, Changlong Sun, Fei Wu, Kun Kuang

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on two professional text classification datasets (e.g., medicine and law), we demonstrate that our method achieves improved classification accuracy along with more coherent attention distributions.
Researcher Affiliation	Collaboration	Yiquan Wu1, Yifei Liu2, Ziyu Zhao1, Weiming Lu1 , Yating Zhang3, Changlong Sun3, Fei Wu1, Kun Kuang1 1 College of Computer Science and Technology, Zhejiang University, China 2 College of Software Technology, Zhejiang University, China 3 Alibaba Group, China
Pseudocode	Yes	Algorithm 1: The pseudocode of DAS.
Open Source Code	Yes	To motivate other scholars to investigate this problem, we make the code and data publicly available 2. 2https://github.com/6666ev/DAS
Open Datasets	Yes	Legal Verdict4. This dataset is released by Chinese AI and Law Challenge (CAIL2018) (Zhong et al. 2018), and it has been widely used in Legal AI research. Medical Triage5. This dataset collects medical conversations. The input is patients questions and the output is the corresponding department. The statistics of the two datasets are presented in Tab. 2. 4https://github.com/thunlp/CAIL 5https://github.com/liangsbin/Chinese-medical-dialogue-data
Dataset Splits	Yes	To ensure fair evaluations, we partition each dataset randomly into training, validation, and test sets, maintaining an 80%:10%:10% ratio.
Hardware Specification	Yes	We conducted our experiments using two V100 GPUs.
Software Dependencies	No	The paper mentions using 'Gensim' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	The size of the keyword vocabulary is set to 1000. The setting for λ is 0.15.