Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Authors: Ling Yang, Xinchen Zhang, Ye Tian, Shiyi Zhang, Chenming Shang, Minghao Xu, Wentao Zhang, Bin CUI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of Hermes Flow as a general alignment framework for next-generation multimodal foundation models.
Researcher Affiliation Academia Ling Yang1 Xinchen Zhang2 Ye Tian3 Shiyi Zhang2 Chenming Shang2 Minghao Xu3 Wentao Zhang3 Bin Cui3 1 Princeton University 2 Tsinghua University 3 Peking University
Pseudocode Yes Algorithm 1 The pseudocode of Hermes Flow
Open Source Code Yes https://github.com/Gen-Verse/Hermes Flow
Open Datasets Yes We randomly select 5,000 image-caption pairs from Journey DB [33] as our homologous input data. For the Visual Question Answering (VQA) data corresponding to each pair, we combine the VQA from Journey DB with the VQA generated from TIFA [14] for a comprehensive evaluation. Our Hermes Flow is trained upon Show-o [48], using a batch size of 4 for both caption and generation data over 3,000 steps. We employ the Adam W optimizer with a weight decay of 0.01, Evaluation Metrics To assess multimodal understanding capabilities, we evaluate using POPE [20], MME [5], Flickr30k [25], VQAv2 [8], GQA [15], and MMMU [58]. For visual generation capabilities, we use Gen Eval [7] and DPG-Bench [13] to evaluate the model s prompt-image alignment. We further assess image fidelity with FID [12] and CLIP-Score [27]. Additionally, we conduct a comprehensive user study to objectively compare our model with other baselines.
Dataset Splits No The paper states: "We randomly select 5,000 image-caption pairs from Journey DB [33] as our homologous input data." It describes how this data is used for curating preference data and that evaluation is done on external benchmarks (POPE, MME, etc.). However, it does not specify how these 5,000 Journey DB pairs themselves are split into training, validation, or test sets for the Hermes Flow model's training or iterative optimization process.
Hardware Specification Yes All experiments are conducted under 8*NVIDIA A100 GPUs.
Software Dependencies No The paper mentions the "Adam W optimizer" but does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow, Python) or other dependencies.
Experiment Setup Yes Our Hermes Flow is trained upon Show-o [48], using a batch size of 4 for both caption and generation data over 3,000 steps. We employ the Adam W optimizer with a weight decay of 0.01, and an initial learning rate of 2e-5 with a cosine scheduling. The parameter β for Pair-DPO is set to 0.2.