Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Reject with a Fixed Predictor: Application to Decontextualization
Authors: Christopher Mohri, Daniel Andor, Eunsol Choi, Michael Collins, Anqi Mao, Yutao Zhong
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we choose the decontextualization task, and provide a manually-labelled dataset of 2,000 examples. Our algorithm significantly outperforms the baselines considered, with a 25% improvement in coverage when halving the error rate, which is only 3% away from the theoretical limit. |
| Researcher Affiliation | Collaboration | Christopher Mohri1, Daniel Andor2, Eunsol Choi3, Michael Collins2, Anqi Mao4, Yutao Zhong4 1Stanford University, 2Google, 3The University of Texas at Austin, 4Courant Institute |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. Methods are described in prose. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for their methodology or a link to a code repository. |
| Open Datasets | Yes | For our experiments, we labeled 2,000 decontextualizations of a fixed MT5 XXL model (Xue et al., 2020) ourselves... We randomly split our 2,000 annotation examples into 1,500 train/500 validation examples and perform 4-fold cross-validation... We provide additional empirical evaluation on two simpler image classification datasets: Fashion-MNIST (Xiao et al., 2017) and KMNIST (Clanuwat et al., 2018). |
| Dataset Splits | Yes | We randomly split our 2,000 annotation examples into 1,500 train/500 validation examples and perform 4-fold cross-validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU specifications, or memory amounts. |
| Software Dependencies | Yes | We further fine-tune a T5X 1.1 XXL decontextualization model (Roberts et al., 2022)... |
| Experiment Setup | Yes | We perform a hyper-parameter search over {1e 4,1e 3,1e 2} for the learning rate, and {0,0.05,...,0.2} for the dropout rate. |