Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment

Authors: Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using ISG-BENCH, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-AGENT, a baseline agent employing a plan-execute-refine pipeline to invoke tools, achieving a 122% performance improvement. 4 EXPERIMENTS AND ANALYSIS We first validate ISG against human annotations (Section 4.1), demonstrating its alignment with human judgments. Our subsequent evaluation of interleaved generation (Section 4.2) reveals the limitations of unified models and moderate success of compositional approaches, underscoring current challenges in instruction-following for interleaved generation.
Researcher Affiliation	Academia	1University of Washington, 2Huazhong University of Science and Technology, 3University of Illinois Urbana-Champaign, 4University of Notre Dame Equal Contribution, Corresponding Author
Pseudocode	Yes	The pseudo-algorithm of ISG is shown in Algorithm 1. We provide prompts for using MLLM to build ISG in and judge model s responses Algorithm 1 ISG Evaluation 1: procedure EVALUATE(P, G) P: Prompt, G: Generated Answer 2: SPred LLM(P) Predict Structure 3: if Structure Match(SPred, G) then 4: return Evaluate With Whole(P, G) 5: end if 6: Questions Generate QA(P) Construct Block-wise (T-T, T-I, I-I) Evaluation 7: score 0, total 0 8: for all r = (sub, obj, Q) Questions do 9: judgement VQA Module(Q, sub, obj) 10: if judgement = Yes then 11: score score + 1 12: end if 13: total total + 1 14: end for 15: Final Score score/total 16: return Final Score 17: end procedure
Open Source Code	No	The paper includes a project website (https://interleave-eval.github.io) but does not explicitly state that the source code for ISG or ISG-AGENT is provided there or via a specific repository link. The text mentions using various third-party tools but not their own implementation code for the described methodology.
Open Datasets	Yes	In conjunction with ISG, we introduce a benchmark, ISG-BENCH, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Based on ISG, we develop the first benchmark, termed ISG-BENCH, for interleaved text-and-image generation to assess multimodal understanding and generation capabilities across various tasks. As shown in Table 2, ISG-BENCH consists of a categorically balanced dataset of 1,150 samples, covering 21 subtasks across 8 daily interleaved generative scenarios. We provide all MLLM prompts used, and all image-text content was safety-reviewed to ensure benchmark security, quality, and transparency.
Dataset Splits	No	The paper introduces ISG-BENCH as a benchmark dataset for evaluating models, consisting of 1,150 samples. It does not provide explicit training, testing, or validation splits for this dataset, as it is primarily intended for evaluation rather than training. While it mentions how samples were collected and categorized, it doesn't describe data partitioning for reproducing experiments that would involve training models on ISG-BENCH.
Hardware Specification	Yes	Table 9: Average computing time for one sample in ISG-BENCH on A800 servers.
Software Dependencies	No	The paper mentions using specific models like 'GPT-4o', 'Claude-3.5-Sonnet', 'Gemini-1.5-pro-latest', 'Stable Diffusion 2.1', 'Flux.1-dev', 'SD3', 'Instructpix2pix', 'Ultra Edit', 'DynamiCrafter', 'SV3D', and 'Dream Mover', along with their hyperparameters. However, it does not explicitly list general ancillary software dependencies like programming language versions (e.g., Python 3.x) or common library versions (e.g., PyTorch 1.x, CUDA x.x).
Experiment Setup	Yes	D.3 MODEL SETTINGS Open-source Unified Models. We employed four open-source unified models, namely Show-o, Mini-GPT5, Anole, Co MM-Mini GPT5 (Mini-GPT5 finetuned on Co MM) and Seed Llama-14B. All hyper-parameters are detailed as follows: Show-o (Xie et al., 2024) Guidance Scale: 1.75, generation timesteps: 18, temperature: 0.7, resolution: 256 256. Mini-GPT5 (Zheng et al., 2023). Image size: 224. temperature: 0.7, repetition penalty: 1.2, guidance scale: 7.5 D.2 ISG-AGENT DETAILS Image Generation Tool: We use Stable Diffusion 2.13 or Flux.1-dev4 to generate images based on textual prompts. In the system, the tool agent automatically provides refined and concise prompts extracted from the step s prompt for better generation performance. Input Size: 512 512 pixels. Inference Steps: 28.