Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Out-of-Distribution Detection and Selective Generation for Conditional Language Models
Authors: Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, Peter J Liu
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. |
| Researcher Affiliation | Collaboration | 1Google Research 2Carnegie Mellon University, work done while at Google Research |
| Pseudocode | Yes | Algorithm 1 Fitting Gaussians for input and output embeddings |
| Open Source Code | No | Not found. The paper does not contain an explicit statement about releasing the source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We ο¬ne-tuned PEGASUSLARGE (Zhang et al., 2020) on the xsum (Narayan et al., 2018) dataset, consisting of BBC News articles with short, abstractive summaries. |
| Dataset Splits | No | Not found. The paper describes training and testing datasets, but does not explicitly state the use of validation splits or their sizes/strategies for reproducibility. |
| Hardware Specification | No | Not found. The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud computing instances). |
| Software Dependencies | No | Not found. The paper mentions the use of an 'Adafactor optimizer' but does not provide specific software names with version numbers for reproducibility (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The model is trained with Adafactor optimizer (Shazeer & Stern, 2018) for 2M steps with 0.1 dropout and 1024 batch size. Decoding is done using beam search with 10 beam size and Ξ± = 0.6 length normalization (Wu et al., 2016b). |