Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ChatbotID: Identifying Chatbots with Granger Causality Test

Authors: Xiaoquan Yi, Haozhao Wang, Yining Qi, Wenchao Xu, Rui Zhang, Yuhua Li, Ruixuan Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across multiple datasets and detection models demonstrate the effectiveness of our framework, with 15.92% improvements in accuracy for distinguishing between H-H and H-C dialogues.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China 2School of Division of Integrative Systems and Design Hong Kong University of Science and Technology, Hong Kong, China.
Pseudocode	No	The paper describes the methodology in Section 4 using mathematical formulations and textual explanations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The dataset and code are in https://anonymous.4open.science/r/ Distinguishing-LLMs-by-Analyzing-Dialogue-Dynamics-with-Granger-Causality-56E4/. This direct provision of code and the newly constructed dataset is the strongest factor supporting reproducibility.
Open Datasets	Yes	To evaluate our proposed methodology across different conversational settings, we utilize four prominent English-language dialogue datasets: two focused on open-domain chit-chat (e.g., Daily Dialog [57], Persona Chat [58]) and two on task-oriented interactions (e.g., Multi WOZ [59], Taskmaster-1 [60]).
Dataset Splits	No	The paper states: "The model is fine-tuned on the constructed H-H and H-M datasets... The model is evaluated on a separate test set, and the results were averaged over 5 runs to account for variability in training." However, it does not provide specific percentages or absolute counts for the training, validation, or test splits of these constructed datasets.
Hardware Specification	Yes	All our experiments were meticulously conducted on a high-performance computing platform running Ubuntu. The platform is powered by an Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz, delivering robust computational capabilities. The system is equipped with a substantial 503 GB of memory, ensuring efficient data processing and storage. Additionally, to further enhance computational power, we utilized four NVIDIA Corporation GA102GL RTX A6000 GPUs.
Software Dependencies	No	We implement our proposed methodology using the Hugging Face Transformers library. The GCT analysis is performed using the statsmodels library. However, specific version numbers for these libraries are not provided.
Experiment Setup	Yes	We implement our proposed methodology using the Hugging Face Transformers library. The model is fine-tuned on the constructed H-H and H-M datasets, with a batch size of 16 and a learning rate of 2e-5. The GCT analysis is performed using the statsmodels library, with a maximum lag of 5 for Granger causality tests. The model is trained for 20 epochs, with early stopping based on validation loss. The model is trained to utilize the Adam W optimizer, incorporating weight decay to enhance regularization.