Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

Authors: Yongxin He, Shan Zhang, Yixuan Cao, Lei Ma, Ping Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments 4.1 Experimental Setup Dataset. We conduct in-distribution supervised experiments on the MAGE [16], M4 [46, 47], Turing Bench [48], OUTFOX [37], and RAID [49] datasets. [...] Comparison Methods and Metrics. We evaluate zero-shot methods [...] Evaluation metrics include F1, Avg Rec (mean recall of human and machine text), and AUC-ROC. We additionally report TPR@5% FPR to assess detection performance under low false positive constraints. 4.2 Results We explore our method across five key dimensions: (I) HAT Analysis: effectiveness of TSCL and interpretability of the HAT structure; (II) Supervised Detection: model performance on in-distribution binary classification tasks; (III) Out-of-Distribution Generalization: model performance on binary classification under OOD settings; (IV) Hybrid Text Detection: the ability to identify varying forms of AI involvement in collaborative text generation; (V) Practical Robustness and Deployability: model performance under real-world constraints, including adversarial perturbations, database compression, and limited category coverage.
Researcher Affiliation	Academia	1State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 4University of Chinese Academy of Sciences, CAS, Beijing, China
Pseudocode	Yes	Algorithm 1 Hierarchical Affinity Tree Construction Algorithm 1: Input: 2: similarity matrix S RN N,end score s 3: Output: 4: Hierarchical Affinity Tree T 5: Definitions: 6: Silhouette Score: Clustering quality metric. Partition: Divide the subtree of node into subgroups whose root merge_score is bounded below by τ. Agglomerative Clustering: Hierarchical clustering method. node.merge_score: Subtree similarity at merge. 7: Step1 Hierarchical Clustering : 8: Build dendrogram Thc Agglomerative Clustering(S) 9: Step2 Tree Reconstruction : 10: function RECONSTRUCTION(node) 11: T node 12: if node.is_leaf or node.merge_score s then 13: return T 14: end if 15: τ set of merge_scores from all descendant nodes of node. 16: Initialize tracker τ , score 17: for each τ τ do 18: Compute the partition C Partition(node, τ ) 19: score Silhouette Score(C, S) 20: if score > score then 21: τ τ , score score 22: end if 23: end for 24: children Partition(node, τ ) 25: for each child children do 26: subtree Reconstruction(child) 27: T .add_edge(node, subtree) 28: end for 29: return T 30: end function 31: T Reconstruction(Thc.root, 0)
Open Source Code	Yes	Our code and dataset are available at https://github.com/heyongxin233/DETree.
Open Datasets	Yes	Our code and dataset are available at https://github.com/heyongxin233/DETree. [...] Justification: We introduce a new dataset in this paper. The dataset is constructed based on publicly available sources, and we will release it under the MIT license with accompanying documentation that includes construction details, data sources, licensing information, and usage instructions. No personal data is involved, and an anonymized download link will be provided at submission.
Dataset Splits	Yes	4.1 Experimental Setup Dataset. We conduct in-distribution supervised experiments on the MAGE [16], M4 [46, 47], Turing Bench [48], OUTFOX [37], and RAID [49] datasets. [...] Detailed dataset specifications are provided in Appendix H and I. [...] Table 7: Composition of the Real Bench dataset, where Basic text and Hybrid text represent the number of texts in basic and hybrid categories respectively, Basic categories and Hybrid categories denote the number of basic and hybrid categories. In the test set, the number of categories is presented as 'x/y', with x being the corresponding category number and y representing the number of categories that only exist in the test set but not in the validation and training sets. Dataset Name Split Basic Hybrid Total Basic Hybrid Total text text text categories categories categories MAGE train 1,433,025 4,128,555 5,561,580 28 419 447 valid 255,164 809,730 1,064,894 28 419 447 test 257,232 412,137 669,369 29/1 223/140 252
Hardware Specification	Yes	Training runs for 10 epochs on 8 RTX 4090 GPUs with a batch size of 64 and a maximum input length of 512 tokens.
Software Dependencies	No	We fine-tune RoBERTa-large [57] using LoRA [58], with AdamW [59], cosine annealing, a 3e-5 initial learning rate, and 2000 step linear warm-up. [...] For inference, we use Faiss-GPU [60] for efficient K-means and K-Nearest Neighbors.
Experiment Setup	Yes	We fine-tune RoBERTa-large [57] using LoRA [58], with AdamW [59], cosine annealing, a 3e-5 initial learning rate, and 2000 step linear warm-up. Training runs for 10 epochs on 8 RTX 4090 GPUs with a batch size of 64 and a maximum input length of 512 tokens. For inference, we use Faiss-GPU [60] for efficient K-means and K-Nearest Neighbors. The representation layer is selected from layers 17–19, and the number of neighbors k is set to 5 or 50 based on validation performance. To compute class probabilities, we follow DINOv2 [61] and replace majority voting with similarity-weighted scoring over the retrieved neighbors. The contrastive-learning temperature τ is set to 0.07.