Preference Alignment with Flow Matching

Authors: Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Se-Young Yun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference.
Researcher Affiliation Academia Minu Kim1 Yongsik Lee1 Sehyeok Kang1 Jihwan Oh1 Song Chong1 Se-Young Yun1 1KAIST AI {minu.kim, dldydtlr93, kangsehyeok0329, ericoh929, songchong, yunseyoung}@kaist.ac.kr
Pseudocode Yes Detailed algorithm can be found in Algorithm 1. ... Algorithm 1: PFM: Preference Flow Matching
Open Source Code Yes Our code is available at https://github.com/jadehaus/preference-flow-matching.
Open Datasets Yes We first evaluate PFM on a conditional image generation task using the MNIST dataset [Le Cun et al., 1998]. ... We train a preference flow on randomly selected pairs of movie reviews y+, y from the IMDB dataset [Maas et al., 2011]. ... we employ the D4RL [Fu et al., 2020] benchmark to assess the performance of PFM in reinforcement learning tasks.
Dataset Splits No The paper does not explicitly provide details about validation dataset splits or a specific validation methodology.
Hardware Specification Yes All experiments were conducted on a single Nvidia Titan RTX GPU and a single i9-10850K CPU core for each run.
Software Dependencies No The paper mentions software components such as DCGAN, Le Net, T5-based autoencoder, GPT-2 SFT model, PPO, behavior cloning, and uses terms like "PyTorch" indirectly (via common usage of such models) but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We utilize a pre-trained DCGAN [Radford et al., 2015] generator as πref and collect sample pairs from πref( |x) conditioned on the digit labels x {0, , 9}. To construct preference datasets, we assign preferences to sample pairs according to the softmax probabilities of the labels from a Le Net [Le Cun et al., 1998]. ... we adopt the pre-trained sentiment classifier as the preference annotator. ... For our PFM framework to be applied to variable-length inputs, we employ a T5-based autoencoder to work with fixed-sized embeddings. ... we search KL regularization coefficient β from 0.01 to 100 and adopt the best one. ... The preference datasets consist of 1,000 pairs of preferred and rejected segments and their context for each offline dataset, with the segment length 10.