Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Authors: Ling Yang, Xinchen Zhang, Ye Tian, Shiyi Zhang, Chenming Shang, Minghao Xu, Wentao Zhang, Bin CUI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of Hermes Flow as a general alignment framework for next-generation multimodal foundation models.
Researcher Affiliation	Academia	Ling Yang1 Xinchen Zhang2 Ye Tian3 Shiyi Zhang2 Chenming Shang2 Minghao Xu3 Wentao Zhang3 Bin Cui3 1 Princeton University 2 Tsinghua University 3 Peking University
Pseudocode	Yes	Algorithm 1 The pseudocode of Hermes Flow
Open Source Code	Yes	https://github.com/Gen-Verse/Hermes Flow
Open Datasets	Yes	We randomly select 5,000 image-caption pairs from Journey DB [33] as our homologous input data. For the Visual Question Answering (VQA) data corresponding to each pair, we combine the VQA from Journey DB with the VQA generated from TIFA [14] for a comprehensive evaluation. Our Hermes Flow is trained upon Show-o [48], using a batch size of 4 for both caption and generation data over 3,000 steps. We employ the Adam W optimizer with a weight decay of 0.01, Evaluation Metrics To assess multimodal understanding capabilities, we evaluate using POPE [20], MME [5], Flickr30k [25], VQAv2 [8], GQA [15], and MMMU [58]. For visual generation capabilities, we use Gen Eval [7] and DPG-Bench [13] to evaluate the model s prompt-image alignment. We further assess image fidelity with FID [12] and CLIP-Score [27]. Additionally, we conduct a comprehensive user study to objectively compare our model with other baselines.
Dataset Splits	No	The paper states: "We randomly select 5,000 image-caption pairs from Journey DB [33] as our homologous input data." It describes how this data is used for curating preference data and that evaluation is done on external benchmarks (POPE, MME, etc.). However, it does not specify how these 5,000 Journey DB pairs themselves are split into training, validation, or test sets for the Hermes Flow model's training or iterative optimization process.
Hardware Specification	Yes	All experiments are conducted under 8*NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions the "Adam W optimizer" but does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow, Python) or other dependencies.
Experiment Setup	Yes	Our Hermes Flow is trained upon Show-o [48], using a batch size of 4 for both caption and generation data over 3,000 steps. We employ the Adam W optimizer with a weight decay of 0.01, and an initial learning rate of 2e-5 with a cosine scheduling. The parameter β for Pair-DPO is set to 0.2.