Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generative Pre-trained Autoregressive Diffusion Transformer

Authors: Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiying Lu, Bo Wang, Haoyang Huang, Jianlong Yuan, Nan Duan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments in three scenarios: video generation, video representation, and few-shot learning. The results demonstrate that GPDi T exhibits excellent generative and representational capabilities, which are crucial for building a unified model for visual understanding and generation, as well as the ability to transfer to downstream tasks with minimal cost and no need for additional modules.
Researcher Affiliation Collaboration Yuan Zhang1 , Jiacheng Jiang2 , Guoqing Ma3 , Zhiying Lu4, Haoyang Huang3, Jianlong Yuan3 1Peking University 2Tsinghua University 3Step Fun, China 4University of Science and Technology of China
Pseudocode No The paper describes the methodology in prose and mathematical formulations in Section 4 "Generative Pre-trained Autoregressive Diffusion Transformer (GPDi T)" and its subsections, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release code with instructions to reproduce the results.
Open Datasets Yes Datasets. For video generation task, UCF-101 [39] dataset consists of 13,320 videos across 101 action categories and is widely used for human action recognition, MSR-VTT [49] is a large-scale dataset designed for open-domain video captioning, containing 10,000 video clips from 20 categories, with each clip annotated with 20 English sentences by Amazon Mechanical Turk workers. We assess the capability of GPDi T in video representation on the UCF-101 dataset. ... First, we perform a 200k-iteration warm-up using an unconditioned image dataset from LAION-Aesthetic [35] with a learning rate of 1e-4 and batch size of 960.
Dataset Splits No For video generation, we randomly sample 10,000 videos from UCF-101 and 7,000 videos from MSR-VTT. The Fréchet Video Distance (FVD) [42] is computed for entire videos, while the average Fréchet Inception Distance (FID) [12] and Inception Score (IS) [34] are calculated over individual frames. For the video representation task, top-1 accuracy is reported using a linear probing protocol. In the few-shot learning setting, we provide per-task video results along with qualitative analyses. For each task, we create a SFT dataset with 20 video sequences, each generated by sampling three pairs from a set of 40 task-specific image pairs.
Hardware Specification Yes The Adam optimizer with a learning rate of 1e-4 and a total batch size of 96 across 32 H100 GPUs is used.
Software Dependencies No The Adam optimizer with a learning rate of 1e-4 and a total batch size of 96 across 32 H100 GPUs is used. Training lasts for 400k iterations.
Experiment Setup Yes Implementation details. To ensure fair comparison, we design a benchmark model with 80 million parameters based on the architecture in Table 1. Trained on UCF-101, each video is center-cropped and resized to 256 256. The Adam optimizer with a learning rate of 1e-4 and a total batch size of 96 across 32 H100 GPUs is used. Training lasts for 400k iterations. We further scale the model to a two-billionparameter variant, GPDi T-H (see Table 1). First, we perform a 200k-iteration warm-up using an unconditioned image dataset from LAION-Aesthetic [35] with a learning rate of 1e-4 and batch size of 960. Training continues for another 200k iterations on a mixed image-video dataset, with equal sampling of images and videos, and batch sizes of 256 and 64, respectively. Video frames are sampled every three frames and clipped into 17-frame segments. Each image is center-cropped to the resolution closest to the original, with target sizes of 256 256, 192 320, or 320 192, and video to 192 320. Finally, we continue training the GPDi T-H model on a pure video dataset featuring variable video lengths ranging from 17 to 45 frames. This stage lasts for an additional 150k iterations, using a reduced learning rate of 2e-5. The resulting model is denoted as GPDi T-H-LONG. To compress video latents, we employ Wan VAE [43], which reduces four frames into a single latent representation. ... During inference, we apply classifier-free guidance with a scale of 1.2 for the GPDi T-H model and 2.0 for the GPDi T-B model.