Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Iterative Foundation Model Fine-Tuning on Multiple Rewards
Authors: Pouya M. Ghari, simone sciabola, Ye Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines. |
| Researcher Affiliation | Industry | Pouya M. Ghari Biogen Simone Sciabola Biogen Ye Wang Biogen Corresponding Author: EMAIL |
| Pseudocode | Yes | Algorithm 1 Iterative RS: Iterative Multi-Objective Model Fine-Tuning 1: Input: Reference policy πref, learning rate η, merge frequency m. 2: Initialize πθi,1, i {1, . . . , N} as πref; S0 by sampling M objectives uniformly. 3: for t = 1, . . . , T do 4: Set St = St 1 5: if t mod m = 0 then 6: Merge policy weights {θi,t}N i=1 to obtain the shared parameter ρt as in equation 9. 7: Sample uniformly at random M objectives to update St. 8: end if 9: For any objective i St, update the policy parameter θi,t as in equation 8. 10: end for 11: Merge all policy weights {θi,T }N i=1 to obtain the shared parameter ρT . 12: Output: Policy πρT . |
| Open Source Code | Yes | Codes are available at https://github.com/ pouyamghari/Iterative RS. |
| Open Datasets | Yes | The goal of this task is to generate small molecules that exhibit specific desirable energy properties...A GPT-2 model is pre-trained on SMILES representations of 2 million molecules from the MOSES dataset [40], resulting in a model referred to as Mol GPT-2. This pre-trained model is then fine-tuned on the QM9 dataset [6, 45] to optimize for multiple objectives. ...a GPT-2 model referred to as DNAGPT-2 is pre-trained on approximately 700,000 unlabeled DNA sequences, each 200 base pairs long, from the MPRA dataset [18]... ...we use Llama-3.2-3B-Instruct as the base model. This foundation model is fine-tuned on the Reddit Summary dataset [49] for the post summarization task. |
| Dataset Splits | Yes | The dataset was split into 80% training, 10% validation, and 10% test sets. |
| Hardware Specification | Yes | Model training was conducted using four V100 GPUs. |
| Software Dependencies | No | To fine-tune the Mol GPT-2 model using MORLHF, RS, and Iterative RS, we employed PPO from the TRL library. |
| Experiment Setup | Yes | All models were fine-tuned with a learning rate of 1.41 10 5 using the Adam optimizer and a batch size of 128. ...the number of optimization steps is set to T = 100. ...For Iterative RS, the merging frequency is set to m = 4. |