Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis

Authors: Leitian Tao, Xuefeng Du, Sharon Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18 faster in generation and using a 16,000 smaller model. We demonstrate strong empirical results across reward modeling benchmarks, achieving superior performance to text-based augmentation while being 18 faster and 16,000 smaller in model size, with detailed ablations validating design choices. In this section, we evaluate our latent-based synthesis approach for reward modeling, comparing LENS against baseline approaches across different scales and sample sizes, followed by ablation studies to analyze the impact of various components of our methodology.
Researcher Affiliation Academia Leitian Tao1 Xuefeng Du2 Sharon Li1 1Department of Computer Sciences, University of Wisconsin-Madison 2College of Computing and Data Science, Nanyang Technological University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes its methodology in Section 3, breaking it down into stages, but it does not present any of these stages or other procedures in a structured pseudocode or algorithm block.
Open Source Code Yes Code is publicly available at https://github.com/deeplearning-wisc/lens.
Open Datasets Yes We use two preference datasets: (1) HH-RLHF [25], which contains human preference pairs focused on helpfulness and harmlessness; and (2) the TL;DR summarization [26], consisting of preference pairs for Reddit post summarization.
Dataset Splits Yes To explore the effectiveness of using synthesis to extend the training dataset in a sample-limited scenario, we subsample 1,000 samples as seed samples. In our ablations, we extensively verify different LLM backbones and different numbers of seed samples. For each prompt in the test set, we generate n=16 candidate responses from the base model. We train our VAE model on several subsets of the original HH-RLHF preference embeddings, with these subsets having varying sizes N {100, 500, 1000, 2000, 5000, 10000, 50000, 100000}.
Hardware Specification Yes All experiments were conducted on NVIDIA A100 GPUs. This translates to a 13 speedup in total processing time (from 5.2 hours to 0.4 hours on a single A100 GPU).
Software Dependencies No The paper mentions using specific base LLM models like Llama-3.1-8B-Instruct, Skywork, and GPT-4, and refers to concepts like VAEs and MLPs. However, it does not provide specific version numbers for any ancillary software libraries, programming languages (e.g., Python version), or deep learning frameworks (e.g., PyTorch, TensorFlow versions) used for implementation.
Experiment Setup Yes Our Variational Autoencoder (VAE) utilized a 2-layer MLP for both its encoder and decoder, with hidden dimensions of 64 and a latent dimension of 16. The VAE was trained for 100 epochs using the Adam optimizer with a learning rate of 1e-4 and a batch size of 128. The divergence loss weight γ (see Section 3.1) was set to 0.1. For latent space synthesis (see Section 3.2), we applied perturbations using a noise variance of σ2 noise = 0.01. The embedding-based reward model, a two-layer MLP with a hidden dimension of 256, was trained with a learning rate of 1e-4 for up to 20 epochs, employing an early stopping mechanism with a patience of 5 epochs. Specifically, we trained for 1 epoch with a learning rate of 1 10 5, a batch size of 32, 1 gradient accumulation step, and a maximum sequence length of 512 tokens. We employed Deep Speed Zero stage 2 and performed full fine-tuning.