Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Weak-to-Strong Generalization under Distribution Shifts

Authors: Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate RAVEN on image classification, text classification, and preference alignment in text generation tasks. RAVEN achieves a 55% improvement in image classification, a 57% improvement in text classification, and a 33% improvement in preference alignment compared to the best alternative baselines for each task.
Researcher Affiliation Academia Myeongho Jeon1, Jan Sobotka1, Suhwan Choi2, Maria Brbi c1, 1EPFL 2Seoul National University
Pseudocode Yes Algorithm 1 Robust Adaptive Weighting (RAVEN)
Open Source Code Yes Project website with code: https://brbiclab.epfl.ch/projects/raven
Open Datasets Yes We evaluate RAVEN on image classification, text classification, and preference alignment in text generation tasks. For OOD setting, we use IWILDCAM [5], CAMELYON17 [39], and FMOW [18] as benchmarks to evaluate our framework. In these datasets, the domain is defined by the location of the camera, the hospital, and the time, respectively. For the In D scenario, we adopt the same approach as [10], which utilized IMAGENET. ... Text classification. We employ AMAZON-WILDS [35], MEDMCQA [51], and MEDQA [31] to evaluate RAVEN for text classification. ... Preference alignment. ... HH-RLHF [4], OPENAI SUMMARIZE FROM FEEDBACK[61], and HUMAN-LIKE DPO[11] datasets.
Dataset Splits Yes For each dataset, the training set is used as the source data Psrc, 70% of the OOD validation set is randomly selected as fine-tuning data Ptuning, 10% is reserved as the validation set for hyperparameter tuning, and the remaining 20% is designated as the target data Ptrg. ... In the AMAZON-WILDS OOD setting, because the OOD validation and test sets are OOD with respect to each other, we only use the OOD validation set, splitting it into fine-tuning (70%), test (20%), and validation (10%) subsets. ... The source data is split into training (90%) and validation (10%) subsets.
Hardware Specification Yes Our image classification experiments used a cluster of 8 NVIDIA Ge Force RTX 3090 GPUs, but each individual run required only a single GPU and less than 22 GB of VRAM. ... Our text-classification and DPO experiments used a cluster of 8 NVIDIA H100 GPUs, and each individual run employed the 8 GPUs in parallel.
Software Dependencies No The paper mentions specific optimizers (Adam, AdamW, SGD) and refers to the PyTorch library [52], along with specific models (Qwen2.5-0.5B, Llama-3.2-1B) and frameworks (Lo RA [27]), but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes For each experiment, we perform a grid search to select the learning rate (including its initial value and decay schedule) and the number of training iterations based on validation loss. To determine the duration of the easy-sample guided initialization phase, we conduct a grid search over {10%, 20%, 50%} of the total iterations. ... Table 9: Hyperparameter configurations for RAVEN. ... Table 10: Hyperparameter configurations for training the weak models.