Preference Alignment with Flow Matching
Authors: Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Se-Young Yun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference. |
| Researcher Affiliation | Academia | Minu Kim1 Yongsik Lee1 Sehyeok Kang1 Jihwan Oh1 Song Chong1 Se-Young Yun1 1KAIST AI {minu.kim, dldydtlr93, kangsehyeok0329, ericoh929, songchong, yunseyoung}@kaist.ac.kr |
| Pseudocode | Yes | Detailed algorithm can be found in Algorithm 1. ... Algorithm 1: PFM: Preference Flow Matching |
| Open Source Code | Yes | Our code is available at https://github.com/jadehaus/preference-flow-matching. |
| Open Datasets | Yes | We first evaluate PFM on a conditional image generation task using the MNIST dataset [Le Cun et al., 1998]. ... We train a preference flow on randomly selected pairs of movie reviews y+, y from the IMDB dataset [Maas et al., 2011]. ... we employ the D4RL [Fu et al., 2020] benchmark to assess the performance of PFM in reinforcement learning tasks. |
| Dataset Splits | No | The paper does not explicitly provide details about validation dataset splits or a specific validation methodology. |
| Hardware Specification | Yes | All experiments were conducted on a single Nvidia Titan RTX GPU and a single i9-10850K CPU core for each run. |
| Software Dependencies | No | The paper mentions software components such as DCGAN, Le Net, T5-based autoencoder, GPT-2 SFT model, PPO, behavior cloning, and uses terms like "PyTorch" indirectly (via common usage of such models) but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We utilize a pre-trained DCGAN [Radford et al., 2015] generator as πref and collect sample pairs from πref( |x) conditioned on the digit labels x {0, , 9}. To construct preference datasets, we assign preferences to sample pairs according to the softmax probabilities of the labels from a Le Net [Le Cun et al., 1998]. ... we adopt the pre-trained sentiment classifier as the preference annotator. ... For our PFM framework to be applied to variable-length inputs, we employ a T5-based autoencoder to work with fixed-sized embeddings. ... we search KL regularization coefficient β from 0.01 to 100 and adopt the best one. ... The preference datasets consist of 1,000 pairs of preferred and rejected segments and their context for each offline dataset, with the segment length 10. |