Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Authors: Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, CELINE HUDELOT, Pierre Colombo

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a protocol for evaluating abstention strategies in black-box scenarios (typically encountered when relying on API services), demonstrating their efficacy, and propose a simple yet effective data-driven mechanism. We provide open-source code for experiment replication and abstention implementation, fostering wider adoption and application in diverse contexts.
Researcher Affiliation Collaboration Hippolyte Gisserot-Boukhlef EMAIL Artefact Research Center MICS, Centrale Supélec, Université Paris-Saclay Manuel Faysse EMAIL Illuin Technology MICS, Centrale Supélec, Université Paris-Saclay Emmanuel Malherbe EMAIL Artefact Research Center Céline Hudelot EMAIL MICS, Centrale Supélec, Université Paris-Saclay Pierre Colombo EMAIL Equall.ai MICS, Centrale Supélec, Université Paris-Saclay
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Figure 1 presents a procedure diagram, not a pseudocode algorithm.
Open Source Code Yes We release a code package2 and artifacts3 to enable full replication of our experiments and the implementation of plug-and-play abstention mechanisms for any use case. 2https://github.com/artefactory/abstention-reranker, under MIT license.
Open Datasets Yes We collect six open-source reranking datasets (Lhoest et al., 2021) in three different languages, English, French and Chinese: stackoverflowdupquestions-reranking (Zhang et al., 2015), denoted Stack Overflow in the experiments, askubuntudupquestions-reranking (Lei et al., 2015), denoted Ask Ubuntu, scidocs-reranking (Cohan et al., 2020), denoted Sci Docs, mteb-fr-reranking-alloprof-s2p (Lefebvre Brossard et al., 2023), denoted Alloprof, CMed QAv1-reranking (Zhang et al., 2017), denoted CMed QAv1, and Mmarco-reranking (Bonifacio et al., 2021), denoted Mmarco.
Dataset Splits Yes In this study, in order to guarantee consistent comparisons between abstention mechanisms, we randomly set aside 20% of the initial dataset as a test set, treating the remaining 80% as the reference set.
Hardware Specification Yes We average timed results over 100 runs on a single instance prediction, running the calculations on one Apple M1 CPU (Figure 5). Training compute was obtained on the Jean Zay supercomputer operated by GENCI IDRIS through compute grant 2023-AD011014668R1, AD010614770 as well as on Adastra through project c1615122, cad15031, cad14770.
Software Dependencies No In this work, we use a linear-regression-based confidence function ulin (Fisher, 1922). Formally, for a given unsorted vector of relevance scores z, ulin(z) = β0 + β1s1(z) + + βksk(z), where β0, , βk R are the coefficients fitted on Sθ, with an l2 regularization parameter λ = 0.1 (Hoerl & Kennard, 1970).6 6Scikit-learn implementation (Pedregosa et al., 2011). Appendix D.1 describes parameters for random forest (Ho, 1995) and Multi-Layer Perceptron (MLP) (Rumelhart et al., 1986). While these libraries/methods are mentioned, specific version numbers for their implementations (e.g., Scikit-learn version) are not provided in the text.
Experiment Setup Yes In this work, we use a linear-regression-based confidence function ulin (Fisher, 1922). Formally, for a given unsorted vector of relevance scores z, ulin(z) = β0 + β1s1(z) + + βksk(z), where β0, , βk R are the coefficients fitted on Sθ, with an l2 regularization parameter λ = 0.1 (Hoerl & Kennard, 1970). Appendix D.1: urf is based on a random forest (Ho, 1995) fitted using 100 independent estimators and squared error impurity criterion. ... umlp is based on a Multi-Layer Perceptron (MLP) (Rumelhart et al., 1986) with one hidden layer of size 128 (retained among several values: 32, 64, 128, 256), Re LU activation, and mean squared error loss function. 0.05 was chosen among multiple learning rate values (0.001, 0.005, 0.01, 0.05, 0.1), a batch size equal to that of the reference set was selected (one single iteration per epoch), and 500 training iterations were performed, as it showed the best efficiency-effectiveness trade-off in terms of downstream n AUC.