Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Authors: Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, CELINE HUDELOT, Pierre Colombo

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a protocol for evaluating abstention strategies in black-box scenarios (typically encountered when relying on API services), demonstrating their efficacy, and propose a simple yet effective data-driven mechanism. We provide open-source code for experiment replication and abstention implementation, fostering wider adoption and application in diverse contexts.
Researcher Affiliation Collaboration Hippolyte Gisserot-Boukhlef EMAIL Artefact Research Center MICS, Centrale Supélec, Université Paris-Saclay Manuel Faysse EMAIL Illuin Technology MICS, Centrale Supélec, Université Paris-Saclay Emmanuel Malherbe EMAIL Artefact Research Center Céline Hudelot EMAIL MICS, Centrale Supélec, Université Paris-Saclay Pierre Colombo EMAIL Equall.ai MICS, Centrale Supélec, Université Paris-Saclay
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Figure 1 presents a procedure diagram, not a pseudocode algorithm.
Open Source Code Yes We release a code package2 and artifacts3 to enable full replication of our experiments and the implementation of plug-and-play abstention mechanisms for any use case. 2https://github.com/artefactory/abstention-reranker, under MIT license.
Open Datasets Yes We collect six open-source reranking datasets (Lhoest et al., 2021) in three different languages, English, French and Chinese: stackoverflowdupquestions-reranking (Zhang et al., 2015), denoted Stack Overflow in the experiments, askubuntudupquestions-reranking (Lei et al., 2015), denoted Ask Ubuntu, scidocs-reranking (Cohan et al., 2020), denoted Sci Docs, mteb-fr-reranking-alloprof-s2p (Lefebvre Brossard et al., 2023), denoted Alloprof, CMed QAv1-reranking (Zhang et al., 2017), denoted CMed QAv1, and Mmarco-reranking (Bonifacio et al., 2021), denoted Mmarco.
Dataset Splits Yes In this study, in order to guarantee consistent comparisons between abstention mechanisms, we randomly set aside 20% of the initial dataset as a test set, treating the remaining 80% as the reference set.
Hardware Specification Yes We average timed results over 100 runs on a single instance prediction, running the calculations on one Apple M1 CPU (Figure 5). Training compute was obtained on the Jean Zay supercomputer operated by GENCI IDRIS through compute grant 2023-AD011014668R1, AD010614770 as well as on Adastra through project c1615122, cad15031, cad14770.
Software Dependencies No In this work, we use a linear-regression-based confidence function ulin (Fisher, 1922). Formally, for a given unsorted vector of relevance scores z, ulin(z) = β0 + β1s1(z) + + βksk(z), where β0, , βk R are the coefficients fitted on Sθ, with an l2 regularization parameter λ = 0.1 (Hoerl & Kennard, 1970).6 6Scikit-learn implementation (Pedregosa et al., 2011). Appendix D.1 describes parameters for random forest (Ho, 1995) and Multi-Layer Perceptron (MLP) (Rumelhart et al., 1986). While these libraries/methods are mentioned, specific version numbers for their implementations (e.g., Scikit-learn version) are not provided in the text.
Experiment Setup Yes In this work, we use a linear-regression-based confidence function ulin (Fisher, 1922). Formally, for a given unsorted vector of relevance scores z, ulin(z) = β0 + β1s1(z) + + βksk(z), where β0, , βk R are the coefficients fitted on Sθ, with an l2 regularization parameter λ = 0.1 (Hoerl & Kennard, 1970). Appendix D.1: urf is based on a random forest (Ho, 1995) fitted using 100 independent estimators and squared error impurity criterion. ... umlp is based on a Multi-Layer Perceptron (MLP) (Rumelhart et al., 1986) with one hidden layer of size 128 (retained among several values: 32, 64, 128, 256), Re LU activation, and mean squared error loss function. 0.05 was chosen among multiple learning rate values (0.001, 0.005, 0.01, 0.05, 0.1), a batch size equal to that of the reference set was selected (one single iteration per epoch), and 500 training iterations were performed, as it showed the best efficiency-effectiveness trade-off in terms of downstream n AUC.