Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Authors: Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, Bin Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. DRA-Ctrl provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities.
Researcher Affiliation Collaboration Hengyuan Cao Zhejiang University Yutong Feng Kunbyte AI Biao Gong Ant Group Yijing Tian Hangzhou Normal University Yunhong Lu Zhejiang University Chuang Liu Hangzhou Normal University Bin Wang Kunbyte AI
Pseudocode No The paper describes the proposed method, DRA-Ctrl, and its components (mixup-based transition strategy, Frame Skip Position Embedding, attention masking strategy) in Section 3 and its subsections, using descriptive text and mathematical formulas (e.g., Equation 1, Equation 3). However, there is no clearly labeled 'Pseudocode' or 'Algorithm' block or figure.
Open Source Code Yes The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.
Open Datasets Yes For spatially-aligned image generation, we adopt a subset of the Text-to-Image-2M dataset [64] for training, consisting of around 160K samples... For spatially-aligned generation, we employ the COCO2017 validation dataset [26] comprising 5,000 images resized to 512 512 resolution as the test set... For subject-driven image generation, we utilize the high-quality subset of the Subjects200K dataset [43], comprising approximately 110K image pairs for training... For subject-driven generation, we evaluate our method on Dream Bench [40] by generating images for 25 text prompts per subject...
Dataset Splits Yes For spatially-aligned image generation, we adopt a subset of the Text-to-Image-2M dataset [64] for training, consisting of around 160K samples... For spatially-aligned generation, we employ the COCO2017 validation dataset [26] comprising 5,000 images resized to 512 512 resolution as the test set... For subject-driven image generation, we utilize the high-quality subset of the Subjects200K dataset [43], comprising approximately 110K image pairs for training.
Hardware Specification Yes We employ the Adam W optimizer and conduct training on 2 NVIDIA H800 GPUs (80GB memory each)... This model is trained using 4 NVIDIA H800 GPUs.
Software Dependencies No The paper mentions various models and optimizers used (e.g., AdamW, QWen2.5-VL, CLIP-Large, LoRA, Depth Anything) but does not provide specific version numbers for software dependencies like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch).
Experiment Setup Yes The models are trained with a batch size of 8 and gradient accumulation over 2 steps, resulting in an effective batch size of 16. We employ the Adam W optimizer... For spatially-aligned image generation, we train the model for 6,000 steps... For subject-driven image generation, we train the model for 9,000 steps... DRA-Ctrl employs Lo RA [14] to fine-tune the base model with a rank of 16... Additionally, we set δ to 12 in the Frame Skip Position Embedding (FSPE).