Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning Transformers with Continuous Feedback via Energy Rank Alignment

Authors: Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We deploy this approach to align molecular transformers and protein language models to generate molecules and protein sequences, respectively, with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space. In numerical experiments, we demonstrate that this algorithm successfully aligns molecular transformer model to identify a highly diverse set of chemicals with properties favored by our choice of reward. Finally, we demonstrate that ERA is able to align a protein language model to generate mutated protein sequences with desirable properties according to a computational reward model. We test ERA on both chemical and language tasks to shed light on the following questions: 1) Can we use ERA to robustly fine-tune our model to generate samples according to a desired distribution? 2) What is the effect of changing the inverse-temperature β during ERA? 3) Do we maintain sample diversity (and validity) without regularizing to remain close to a reference policy, and what is the effect of increased regularization? 4) Can we simultaneously target multiple properties with high fidelity, and how can we trade off between desired properties?
Researcher Affiliation	Academia	Shriram Chennakesavalu Department of Chemistry Stanford University Stanford CA, 94305 EMAIL Frank Hu Department of Chemistry Stanford University Stanford CA, 94305 EMAIL Sebastian Ibarraran Department of Chemistry Stanford University Stanford CA, 94305 EMAIL Grant M. Rotskoff Department of Chemistry Stanford University Stanford CA, 94305 EMAIL
Pseudocode	Yes	B ERA implementation Implementing energy rank alignment is straightforward to implement within existing code bases. We provide sample Py Torch code for the ERA loss function below. import torch.nn as nn from torch.nn.functional import logsigmoid def era_loss(pi_logps_1, pi_logps_2, ref_logps_1, ref_logps_2, energies_1, energies_2, beta, gamma): ... return era_loss.mean()
Open Source Code	No	We will release the code as open source upon submission of the paper.
Open Datasets	Yes	Starting from a random initialization, we carry out pretraining on a dataset of 2.4M small molecules from the Ch EMBL database Zdrazil et al. [2024] for 180 epochs. For the work here, we used a computational oracle that predicts the docking score for two kinases, JNK3 and GSK3β, where these oracles were defined using the tdc package. We consider mutating the Trp B protein at 4 sites (positions 182, 183, 184, and 186) to all of the 20 standard amino acids and compute the EVMutation score Hopf et al. [2017] for all 204 = 160000 sequences (see Yang et al. [2023] for dataset).
Dataset Splits	No	No explicit train/test/validation splits with percentages or counts are provided for the primary alignment training of ERA. The paper describes how synthetic datasets were generated or existing datasets were filtered for specific uses (e.g., 'generate a dataset D = {(y(i) 1 , y(i) 2 , U(y(i) 1 ), U(y(i) 2 ))}N i=1', 'fine-tuning step on all molecules in ChemBL with an oracle score above 0.5 (7386 molecules for JNK3 and 43381 for GSK3β)', 'randomly sampled 512 mutated sequences'). While these describe the data used, they do not specify how this data was further divided into training, validation, and test sets for the alignment process itself.
Hardware Specification	Yes	For all chemical alignment experiments, we trained on an in-house cluster with 8 Nvidia 4080 GPUs. For ESM3 experiments, we used resources of the National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility. Jobs run on NERSC used at most 4 Nvidia A100 GPUs (either 40GB or 80GB depending on what was allocated).
Software Dependencies	No	The paper mentions software like PyTorch, RDKit, and tdc, but does not specify version numbers for any of them. For example, 'We provide sample Py Torch code for the ERA loss function below.' and 'All of the properties can be easily computed using either the RDKit package or the tdc Huang et al. [2021] package.'
Experiment Setup	Yes	We use a decoder-only representation for the molecular generator Bagal et al. [2022], where the generator has 2 layers, an embedding dimension of 512, a vocabulary of 324 tokens, and totals 3.5M parameters. For sampling from our molecular generator, we use top-k sampling with k = 5 and a sampling temperature of T = 1 in all experiments for consistency. For pretraining, we used an Adam optimizer with a learning rate of 1.0 x 10^-5. All alignment properties were initialized with the weights of the pretrained model and trained using an Adam optimizer with learning rate 1.0 x 10^-6. For the experiments here, we used the RMSProp optimizer with a learning rate of 1.0 x 10^-5.