Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable In-context Ranking with Generative Models

Authors: Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that Block Rank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7 for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists around 500 documents in-context ( 100K context length) within a second, presenting a scalable and effective solution for ICR.
Researcher Affiliation	Collaboration	Nilesh Gupta1 Chong You3 Srinadh Bhojanapalli3 Sanjiv Kumar3 Inderjit Dhillon1,2 Felix Yu3 1University of Texas at Austin 2 Google 3Google Deepmind
Pseudocode	No	The paper describes methodologies and processes (e.g., Blockwise Structured Attention, Auxiliary Attention Loss) with mathematical formulations and diagrams, but it does not present a formal pseudocode block or algorithm steps formatted like code.
Open Source Code	Yes	https://github.com/nilesh2797/Block Rank
Open Datasets	Yes	Datasets & Formatting. For assessing zero-shot generalization, we use 11 diverse datasets from the BEIR benchmark [Thakur et al., 2021] replicating Table 1 in Reddy et al. [2024]. For in-domain analysis, we use two standard passage retrieval benchmarks: MSMarco Passage Ranking [Bajaj et al., 2018] and Natural Questions (NQ) [Kwiatkowski et al., 2019]. ... We use MSMarco v1 passage retrieval dataset, it has total 8.8M passages, 500K training queries and 6980 validation queries. We directly utilize the hard negatives collection1 from huggingface for training. 1https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3
Dataset Splits	Yes	For in-domain analysis, we use two standard passage retrieval benchmarks: MSMarco Passage Ranking [Bajaj et al., 2018] and Natural Questions (NQ) [Kwiatkowski et al., 2019]. During training, we construct candidate lists for each query by retrieving an initial set of 30 passages using a pretrained sentence transformer model with teacher-forcing (i.e. always adding ground-truth documents). ... MSMarco Passage Ranking [Bajaj et al., 2018]: We use MSMarco v1 passage retrieval dataset, it has total 8.8M passages, 500K training queries and 6980 validation queries. ... Natural Questions (NQ320K) [Kwiatkowski et al., 2019]: We use NQ320K passage retrieval dataset which has 320K passages in the corpus, 300K training queries and 7830 validation queries.
Hardware Specification	Yes	All LLM fine-tuning and inference experiments were conducted using JAX on Google Cloud TPUs (specifically, 8 chip v6e configuration), and reported efficiency metrics correspond to this setup as well.
Software Dependencies	No	The paper mentions "Mistral-7B-v0.3" as the base model and "Adafactor optimizer", "JAX" as the framework. However, it does not provide specific version numbers for JAX or any other libraries used, which would be required for a reproducible description of ancillary software.
Experiment Setup	Yes	Implementation Details. Block Rank and the Full-FT baseline utilize Mistral-7B-v0.3 as the base model. For fine-tuning both models, we employ the Adafactor optimizer [Shazeer and Stern, 2018] with a learning rate of 3 10 7 and a global batch size of 32 (accumulated across replicas). Each model is trained for 1 epoch with a linear warmup followed by cosine decay. For Block Rank, the auxiliary loss weight λ is set to 0.1, and τ is set to 0.05. Unless stated otherwise, Block Rank results employ the proposed attention-based inference. ... B.2 Fine-tuning Details (Block Rank and Full-FT) ... Optimizer: Adafactor [Shazeer and Stern, 2018] with β1 = 0.9. Learning Rate: 3 10 7. Learning Rate Schedule: Linear warmup for 50 steps followed by a cosine decay. Batch Size: A global batch size of 32, accumulated across replicas. Number of Epochs: 1 epoch. Weight Decay: No weight decay. Gradient Clipping: gradient norm clipped to 1.0. Loss for Full-FT: Standard Next Token Prediction (NTP) cross-entropy loss, calculated on the answer tokens (i.e., the ID of the relevant document). B.3 Block Rank Specific Hyperparameters ... Auxiliary Loss Weight (λ): ... set to λ = 0.1 ... Info NCE Temperature (τ): ... set to τ = 0.05. Signal-Carrying Query Tokens (Tq,signal): ... Tq,signal = [":", "[ "]. Middle Layer for Auxiliary Loss (l ): ... set the l = 20 ... Chunk Length (Lchunk): ... set Lchunk = 160 for MSMarco and Lchunk = 384 for NQ