Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Selective Generation for Controllable Language Models
Authors: Minjae Lee, Kyungmin Kim, Taesoo Kim, Sangdon Park
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate the efficacy of the SGen family in achieving a desired FDR-E level with comparable selection efficiency to those from baselines on both open and closed source GLMs. |
| Researcher Affiliation | Academia | Minjae Lee GSAI POSTECH EMAIL Kyungmin Kim GSAI POSTECH EMAIL Taesoo Kim SCS & SCP Ga Tech EMAIL Sangdon Park GSAI & CSE POSTECH EMAIL |
| Pseudocode | Yes | Algorithm 1 Entailment Set Learning with a False Entailment Rate (FER) Guarantee |
| Open Source Code | Yes | Code and datasets are provided at https://github.com/ml-postech/selective-generation. |
| Open Datasets | Yes | We use two GLMs, GPT-3.5-Turbo and Alpaca-7B, alongside the Natural Questions (NQ) dataset to annotate entailment labels for question-answer pairs. [...] we create a dataset on textual entailment using the Natural Questions (NQ) dataset [17] for each GLM. |
| Dataset Splits | Yes | Approximately 7.3k (7,374) and 4.6k (4,595) samples are labeled for Alpaca-7B and GPT-3.5-Turbo, respectively, and both are split into calibration and test data at an 8:2 ratio. |
| Hardware Specification | Yes | Our system environment consists of 4 NVIDIA A100 80GB with 128 CPUs. |
| Software Dependencies | No | The paper mentions models like 'GPT-3.5-Turbo and Alpaca-7B' and 'deberta-v2-xxlarge-mnli' but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | To control an FDR-E, we use two user-specified parameters (ε, δ), where we use (0.25, 0.02) unless specified. For our methods (i.e., SGen Semi, SGen Semi No MS, and SGen Semi-Sup No MS ), we have five control parameters (εS, δS, δE, δW ), where we maps as follows: εS = ε, δS = (δ δW )/2, δE = (δ δW )/2, δW = 10 5. |