Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Rank for In-Context Example Retrieval

Authors: Yuwen Ji, Luodan Zhang, Ambyer han, Haoran Que, Lei Shi, Wang Chao, Yue Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, our method constantly ranks in top-1 across 9 NLP tasks, with an improvement of up to 18% over the best-performing classification-based baseline, achieving the SOTA results. Ablation study further confirms the usefulness of our ranking formulation and its complementary strength to existing paradigms. We summarize our key contributions as follows: [...] 4 Experiments We take the SOTA method Se2 [13] as our base, which is a classification-based method, and implement Se DPO on top of it. We first compare the ICL performance with SOTA retrievers (main results); validate key components in Se DPO (ablation); then provide extra experiments for deeper analysis.
Researcher Affiliation	Collaboration	Yuwen Ji,1,2 Luodan Zhang,2 Ambyer Han,3 Haoran Que,4 Lei Shi,5 Wang Chao,3 Yue Zhang2 1Zhejiang University 2Westlake University 3Amap, Alibaba Group 4Peking University 5Beihang University EMAIL
Pseudocode	No	The paper describes algorithms and formulations (e.g., in Section 3.3 "Algorithm for learning orders"), but does not present a clearly labeled pseudocode block or algorithm steps in a structured format.
Open Source Code	Yes	The code can be found in: https://github.com/2022neo/Se DPO_NIPS25
Open Datasets	Yes	We use a total of 9 tasks across 4 distinct categories, including Paraphrase: MRPC [45], PAWS [46], QQP [47]; Coreference: WSC [48]; Reading: Multi RC [49], Bool Q [50], AGNews [51]; NLI: MNLI-m/mm [52]. [...] All datasets are publicly available under open licenses (e.g., CC-BY, CC-BY-SA, or research-only terms).
Dataset Splits	Yes	MRPC: A paraphrase task with 3,668 training examples and 408 test examples, evaluated using Accuracy and F1. PAWS: A paraphrase task with 49,401 training examples and 8,000 test examples, evaluated using Accuracy. QQP: A paraphrase task with 363,846 training examples and 40,430 test examples, evaluated using Accuracy and F1. WSC: A coreference task with 554 training examples and 104 test examples, evaluated using Accuracy. Multi RC: A reading comprehension task with 27,243 training examples and 4,848 test examples, evaluated using F1. Bool Q: A reading comprehension task with 9,427 training examples and 3,270 test examples, evaluated using Accuracy. AGNews: A reading comprehension task with 120,000 training examples and 7,600 test examples, evaluated using Accuracy. MNLI-m/mm: A natural language inference task with 392,702 training examples and 9,815/9,832 test examples for m/mm, evaluated using Accuracy.
Hardware Specification	Yes	We trained the retriever with 8 threads in a data-distributed manner on 8*A100-80GB.
Software Dependencies	No	The paper mentions models like "BERT-base-uncased" [43] and "GPT-Neo-2.7B" [53] and "Ro BERTA-base" [54] as encoders or LLMs, but does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) with their version numbers.
Experiment Setup	Yes	Table 5: Hyperparameter settings. Hyperparameter Assignment shot-number 3 Max sequence length 512 for retriever Optimizer Adam 2048 for LLMs Number of epochs 6 per GPU Max learning rate 1e-5 Preference ̸ > 0.001 # of samples per batch for Se DPO 832(1+1) < 2 # of samples per batch for Se2 132(1+2*T) Adam epsilon 1e-8 Warmup steps 1000 Adam beta weights 0.9, 0.999 Learning rate decay linear Weight decay 0.0 Learning rate scheduler warmup linear