Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Demonstration Order Optimization for Many-shot In-Context Learning

Authors: Yinhan He, Wendy Zheng, Song Wang, Zaiyi Zheng, Yushun Dong, Yaochen Zhu, Jundong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple LLMs and real-world datasets demonstrate that our HIDO method consistently and efficiently outperforms other baselines. Our code project can be found at https://github.com/Yinhan He123/HIDO/. RQ1: How does HIDO perform compared to existing demonstration order optimization methods across different datasets and language models? RQ2: What is the impact of each key component in HIDO on its overall performance? RQ3: How sensitive is HIDO to its main hyperparameters such as the number of clusters and the maximum number of optimization iterations?
Researcher Affiliation	Academia	Yinhan He University of Virginia Charlottesville, VA EMAIL Wendy Zheng University of Virginia Charlottesville, VA EMAIL Song Wang University of Central Florida Orlando, FL EMAIL Zaiyi Zheng University of Virginia Charlottesville, VA EMAIL Yushun Dong Florida State University Tallahassee, FL EMAIL Yaochen Zhu University of Virginia Charlottesville, VA EMAIL Jundong Li University of Virginia Charlottesville, VA EMAIL
Pseudocode	No	The paper describes the HIDO framework in Section 4 "Methodology" and provides a visual overview in Figure 2, but it does not include explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code project can be found at https://github.com/Yinhan He123/HIDO/.
Open Datasets	Yes	We adopt nine text classification datasets: AG s News Corpus (AGNews) [46], Commitment Bank (CB) [7], Customer Review (CR) [11], DBPedia Ontology Classification (DBPedia) [46], Multi-Perspective Question Answering (MPQA) [39], Movie Review (MR) [30], Recognizing Textual Entailment (RTE) [6], Stanford Sentiment Treebank-5 (SST-5) [34], and Text REtrieval Conference Question Classification (TREC) [37]. (Appendix C further clarifies licenses and public availability for each dataset.)
Dataset Splits	No	We sub-sample 256 instances from each dataset due to budget constraints.
Hardware Specification	Yes	Our conduct experiments using a system equipped with four NVIDIA A100 80GB PCIe GPUs. The system ran NVIDIA driver version 550.54.14 and CUDA 12.4.
Software Dependencies	Yes	We implement the project with Python, mainly relying on the Py Torch [31] and Transformers [40] packages for the implementation. The system ran NVIDIA driver version 550.54.14 and CUDA 12.4.
Experiment Setup	Yes	Here, we address RQ3. Although our model has numerous hyperparameters, we focus our analysis on two we consider most significant: the number of clusters k and the maximum number of optimization iterations l. Fig. 3 (b) illustrates our model s performance with varying k and l on the TREC and MPQA datasets using the Sciphi model. We limit the number of clusters to be small (typically no more than four) as a larger number would cause a combinatorial explosion during HIDO s inter-cluster order optimization stage, where all possible orders are evaluated.