Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rare Text Semantics Were Always There in Your Diffusion Transformer

Authors: seil kang, Woojung Han, Dayun Ju, Seong Jae Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Benchmarks. We evaluate our method primarily on Text-to-Image generation, while demonstrating its versatility in Text-to-Video generation and Text-driven Image Editing. For image generation, we use Rare Bench [5] for rare concept evaluation, and T2I-Comp Bench [40] and Gen Eval [41] for compositional tasks with common concepts. Table 1: Rare Bench performance comparison across various models and categories in Image and Video generation. Figure 10: Ablation studies on Rare Bench.
Researcher Affiliation	Academia	Yonsei University EMAIL
Pseudocode	No	The paper describes methods using mathematical formulas and textual explanations (e.g., in Sections 3 and 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Yes, we will release our codes and instructions once the blind review period is finished.
Open Datasets	Yes	We evaluate our method primarily on Text-to-Image generation... For image generation, we use Rare Bench [5] for rare concept evaluation, and T2I-Comp Bench [40] and Gen Eval [41] for compositional tasks with common concepts.
Dataset Splits	Yes	In total, our analysis considers 100 text prompts. 40 prompts come directly from Rare Bench s eight benchmark categories (five per category) [5], and the remaining 60 prompts are generated with GPT-4o [42], comprising 30 rare prompts, created strictly under Rare Bench [5] s rarity guidelines, and 30 common prompts added as a bias check to ensure that our analysis is not limited to rare cases.
Hardware Specification	Yes	Our computational resources included a single NVIDIA 48GB A6000 GPU for general experiments and results, with a single NVIDIA 80GB A100 GPU dedicated to Text-to-Video generation.
Software Dependencies	No	The paper mentions using Stable Diffusion 3.0 [4] and FLUX.1 [3] and reproducing results from R2F [5], but it does not specify version numbers for these or any other software libraries or dependencies (e.g., PyTorch, CUDA, Python versions) that would be needed for replication.
Experiment Setup	Yes	All hyperparameters are maintained at their default settings except for a single scaling factor (σ = 1.3). Further implementation details are provided in the D.1.