Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Consistent Story Generation: Unlocking the Potential of Zigzag Sampling
Authors: Mingxiao Li, Mang Ning, Marie-Francine Moens
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. |
| Researcher Affiliation | Academia | Mingxiao Li KU Leuven EMAIL Mang Ning Utrecht University EMAIL Marie-Francine Moens KU Leuven EMAIL |
| Pseudocode | Yes | We describe our asymmetric setup for zigzag sampling below, and summarize the proposed method in Algorithms 1 and 2 in the appendix. |
| Open Source Code | Yes | The code is available at https://github.com/Mingxiao-Li/ Asymmetry-Zigzag-Story Diffusion. |
| Open Datasets | Yes | We conducted a user study to evaluate our method in comparison with four existing approaches: IP-Adapter [24], Consistory Model [28], Story Diffusion [27], and 1Prompt1Story [29]. All models were used to generate images based on prompts from the Consi Story+ benchmark, using the same random seeds as reported in their respective papers to ensure a fair comparison. |
| Dataset Splits | Yes | We randomly selected 30 prompts from the benchmark dataset and generated corresponding image sequences using all competing methods. Twenty participants were invited for the user study. For each participant, a custom program randomly selected 20 out of the 30 prompts, and presented four resulting image sequences obtained with different methods for each selected prompt. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA A100 GPU. For the FLUX model, which differs architecturally from SDXL by separating text-image and image-image interaction stages, we adopt a different strategy. Experiments for FLUX are also run on a single NVIDIA A100 GPU. |
| Software Dependencies | Yes | For SDXL, we use the stabilityai/stable-diffusion-xl-base-1.0 version, and for FLUX, we adopt the black-forest-labs/FLUX.1-dev version. |
| Experiment Setup | Yes | For the implementation of our method on the SDXL model, we cache visual tokens only from the mid and upper layers across all steps. Accordingly, feature injection during the zig step is also limited to these layers. We use a classifier-free guidance scale of 5.5 for both the zig and generation steps, and set it to 0 during the zag step. We evaluate the impact of varying k values (ranging from 0.2 to 0.8) on generation performance (Table 5). The results show that increasing k generally improves image similarity but slightly reduces text alignment. Notably, a value of k = 0.2 achieves the best balance between these two objectives. |