Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

Authors: Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, Yingcong Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Comfy Mind on three public benchmarks: Comfy Bench, Gen Eval, and Reason Edit, which span generation, editing, and reasoning tasks. Results show that Comfy Mind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1.
Researcher Affiliation	Collaboration	Litao Guo HKUST(GZ) Xinli Xu HKUST(GZ) Luozhou Wang HKUST(GZ) Jiantao Lin HKUST(GZ) Jinsong Zhou HKUST(GZ) Zixin Zhang HKUST(GZ) Bolan Su Bytedance Ying-Cong Chen HKUST(GZ), HKUST
Pseudocode	No	The paper describes methods and processes in text and provides system prompts as figures, but it does not contain explicit pseudocode blocks or algorithm sections.
Open Source Code	Yes	Project page: https://github.com/En Vision-Research/Comfy Mind
Open Datasets	Yes	We evaluate Comfy Mind on three public benchmarks: Comfy Bench, Gen Eval, and Reason Edit, which span generation, editing, and reasoning tasks. Results show that Comfy Mind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1.
Dataset Splits	No	The paper mentions using Comfy Bench, Gen Eval, and Reason-Edit benchmarks, and also WISE benchmark, but it does not provide specific details about training, validation, or test dataset splits for these benchmarks. For example, it mentions Comfy Bench has '200 graded difficulty generative and editing tasks' but does not specify how these tasks are split.
Hardware Specification	Yes	The server is equipped with an NVIDIA RTX A6000 GPU with 48GB of VRAM, providing sufficient computational capacity for generation tasks.
Software Dependencies	No	The paper mentions the use of the 'Comfy UI platform' and 'LLMs'/'VLMs' but does not specify version numbers for these or other software dependencies used in the implementation of Comfy Mind itself.
Experiment Setup	No	The paper describes the system architecture, components, and evaluation methodology (benchmarks used) but does not provide specific hyperparameters, optimizer settings, or other system-level training configurations for the Comfy Mind agent or its internal models.