Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs
Authors: Amirmohammad Farzaneh, Osvaldo Simeone
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations demonstrate that RG-PT significantly outperforms existing methods such as learn-then-test (LTT) and Pareto testing (PT) through a more efficient exploration of the hyperparameter space. |
| Researcher Affiliation | Academia | Amirmohammad Farzaneh Osvaldo Simeone Centre for Intelligent Information Processing Systems Department of Engineering King s College London London, United Kingdom EMAIL |
| Pseudocode | Yes | Algorithm 1 DAGGER [18] ... Algorithm 2 Reliability Graph-Based Pareto Testing (RG-PT) |
| Open Source Code | Yes | 1The code for the experiments can be found at the anonymous Github repository https://anonymous.4open.science/r/RG-PT-EF3A/ |
| Open Datasets | Yes | Sentiment analysis: In this task, based on the Stanford Sentiment Treebank dataset [6]... Sentence similarity: In this task, based on the Semantic Textual Similarity Benchmark dataset [40]... Word in context: In this task, based on the Word-in-Context dataset [41]... WMT16 Romanian-English dataset [46]... MS-COCO dataset [67]... Fashion MNIST dataset. |
| Dataset Splits | Yes | For each task, we use 1000 examples each for the data sets ZOPT and ZMHT, as well as for the test data set. ... The data set sizes are |Z| = 400, |ZOPT| = 200, and |ZMHT| = 200. ... running 200 trials for each algorithm over different splits of calibration data Z into subsets ZOPT and ZMHT with |ZOPT| = 1500 and |ZMHT| = 1500. ... The calibration data set Z was in turn divided into two groups of size 2,500, for the data sets ZOPT and ZMHT, respectively. |
| Hardware Specification | Yes | All experiments were conducted using dedicated computational resources. Specifically, RG-PT, LTT, and PT runs, along with data generation for the object detection, image classification, and telecommunications engineering tasks, were executed on a machine equipped with an Apple M1 Pro chip (10-core CPU, 16-core GPU, 16 GB RAM). Data generation for the prompt engineering experiment (Section 4.1) and the sequence-to-sequence translation task was performed on an NVIDIA A100 GPU (40 GB VRAM), using CUDA 11.3 and 40 GB system memory. |
| Software Dependencies | No | The paper mentions "CUDA 11.3" but does not provide version numbers for other key software components or libraries (e.g., Python, PyTorch, scikit-learn, Detectron2) that would be needed to replicate the experiments, failing to meet the criteria for listing multiple key components with versions or a self-contained solver with a version. |
| Experiment Setup | Yes | In this experiment, we focus on prompt engineering for the following three tasks from the instruction induction data set [1]: 1. Sentiment analysis: ... we use 1000 examples each for the data sets ZOPT and ZMHT, as well as for the test data set. Furthermore, following the forward generation mode detailed in [42], we use the LLa MA3-70B-Instruct model [43] to generate a set Λ = {λ1, . . . , λ100} of distinct instruction-style prompt templates for each task. ... The objective is to find prompts in set Λ that control the average prompt loss Rprompt(λ) = EZ [rprompt(Z, λ)] below a target level of α = 0.2, while minimizing the average prompt length. For this selection, we wish to control the FDR in (8) at level δ = 0.1. ... We set the pseudocount np to 1,000. ... Table 1: RG-PT parameter settings for each experiment. Experiment D np τ Prompt Engineering 17 1000 0.1 |