Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Authors: Yuyao Zhang, Jinghao Li, Yu-Wing Tai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Layer Craft excels in various creative workflows, from narrative scene composition to iterative and batch image editing, empowering both experts and non-experts to produce controllable, high-quality images with minimal effort. Section 4 describes datasets, APIs, evaluation metrics, and hardware setup.
Researcher Affiliation	Academia	Yuyao ZHANG Dartmouth College Jinghao LI CUHK Yu-Wing TAI Dartmouth College
Pseudocode	No	The paper describes the methodology in text and provides an architectural diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. The operations are described narratively.
Open Source Code	Yes	Code will be released at https://github.com/Peter YYZhang/Layer Craft. Our code, instructions and dataset will be released alongside the final submission in the supplemental.
Open Datasets	Yes	To ensure diversity, we use Chat GPT (via O1) to generate a list of 500 unique objects across various categories...After filtering mismatched pairs using LLM-based validation, we obtain a final dataset of 300,000 high-quality pairs, which we name Image-guided in Painting Assets (IPA300K). The dataset will be released on Hugging Face. We evaluate our Layer Craft framework against two categories of state-of-the-art approaches: multi-agent systems (upper part) and generic models (lower part) on T2I-Compbench [15]...First, we evaluate compositional generalization on the Gen Eval benchmark [13]
Dataset Splits	Yes	OIN is trained for 20,000 iterations on a 50K subset of IPA300K, while Omini Control is fine-tuned for 50,000 iterations. Additional samples are drawn from the remaining dataset for qualitative evaluation. Due to computational constraints, we employed a stratified sampling strategy and evaluated the models on 20% of the test data, ensuring balanced representation across object types and scene configurations.
Hardware Specification	Yes	The Object Integration Network (OIN) is built using Diffusers and PEFT, and trained with a batch size of 1 and gradient accumulation over 4 steps on 4 NVIDIA A6000 Ada GPUs (48GB each).
Software Dependencies	No	We use Open AI s GPT-4o [1] as the base LLM for both the Layer Craft Coordinator and Chain Architect agent...Our text-to-image backbone is FLUX.1-dev [18], implemented via the Hugging Face Diffusers library [43]. The Object Integration Network (OIN) is built using Diffusers and PEFT.
Experiment Setup	Yes	We use Open AI s GPT-4o [1] as the base LLM for both the Layer Craft Coordinator and Chain Architect agent, with the temperature set to 0.1 to balance control and creativity. Our text-to-image backbone is FLUX.1-dev [18]...The Object Integration Network (OIN) is built using Diffusers and PEFT, and trained with a batch size of 1 and gradient accumulation over 4 steps...We use a Lo RA rank of 4 and enable gradient checkpointing for memory efficiency. OIN is trained for 20,000 iterations on a 50K subset of IPA300K, while Omini Control is fine-tuned for 50,000 iterations.