Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring Diffusion Transformer Designs via Grafting

Authors: Keshigeyan Chandrasegaran, Michael Poli, Dan Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Fei-Fei Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our grafting approach in increasingly challenging generative modeling setups: Result I: Grafting yields hybrid architecture designs with good quality for class-conditional image generation (Sec. 4.2). ... Result II: We construct efficient hybrid architectures for high-resolution text-to-image (T2I) generation via grafting (Sec. 5).
Researcher Affiliation Collaboration Keshigeyan Chandrasegaran 1,2 Michael Poli 1,2 Daniel Y. Fu 3,4 Dongjun Kim 1 Lea M. Hadzic 1 Manling Li 1,5 Agrim Gupta 6 Stefano Massaroli 2,7 Azalia Mirhoseini 1 Juan Carlos Niebles 1,8 Stefano Ermon 1 Li Fei-Fei 1 1 Stanford University 2 Liquid AI 3 Together AI 4 UC San Diego 5 Northwestern University 6 Google Deep Mind 7 RIKEN 8 Salesforce Research
Pseudocode No The paper describes procedures in text and uses flowcharts, but does not contain a formal pseudocode or algorithm block. Figure 1(b) shows a high-level overview of the grafting procedure, but it is not pseudocode.
Open Source Code Yes Code and grafted models: grafting.stanford.edu.
Open Datasets Yes For class-conditional image generation, we use Image Net-1K [15].
Dataset Splits Yes In practice, we find that competitive performance can be recovered using only 10% of the training data, even when replacing all MHA or MLP layers in Di T-XL/2. ... For all experiments in Table 3, we use 10% of the Image Net-1K training data and train for 50K steps. ... Given the architectural restructuring, finetuning was performed using 25% of the training data.
Hardware Specification Yes each experiment completes in under 24 hours on 8 H100 GPUs
Software Dependencies No The paper mentions common deep learning frameworks and models like PyTorch and Di T but does not specify version numbers for any software dependencies.
Experiment Setup Yes Stage 1: Operator initialization. For each new operator, we perform activation distillation using 8K Image Net-1K samples. Each operator is trained for 200 epochs with a batch size of 64 and an initial learning rate of 1e-4. ... Stage 2: Lightweight finetuning. For all experiments in Table 3, we use 10% of the Image Net-1K training data and train for 50K steps. We use a batch size of 256, linearly warming up the learning rate to 1e-4 over 1000 steps. Experiments typically complete in under 10 hours on 8 H100 GPUs.