Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Native-Resolution Image Synthesis

Authors: ZiDong Wang, LEI BAI, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in class-guided image generation validate Ni T as a significant advancement due to its native-resolution modeling. With a single model, Ni T firstly attains state-of-the-art (SOTA) results on both 256x256 (2.03 FID, Frรฉchet inception distance) and 512x512 (1.45 FID) benchmarks in class-guided Image Net generation [17]. Impressively, Ni T highlights its strong zero-shot generalization ability. For instance, as shown in Fig. 2, it achieves an FID of 4.52 on unseen 1024x1024 resolution and an FID of 4.11 on novel 9:16 aspect ratio images (i.e., 416x768), excelling in its flexibility and transferability to unfamiliar resolutions and respective ratios.
Researcher Affiliation Collaboration 1CUHK MMLab 2Shanghai AI Lab. The Chinese University of Hong Kong (CUHK) is an academic institution, and Shanghai AI Lab is a prominent research institution, indicating a collaboration between academic/research entities.
Pseudocode Yes Algorithm 1 Packed Full-Attention with Flash Attention for flexible-length sequence processing. Algorithm 2 Packed Adaptive Layer Normalization and Ni T block.
Open Source Code No Project Page: https://wzdthu.github.io/Ni T. While a project page is provided, the paper does not contain an explicit statement about the release of source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes Ni T, trained solely on Image Net, demonstrates excellent zero-shot generalization performance... We conduct text-to-image generation experiments on the SAM [38] dataset with captions generated by Mini CPM-V [86]. We use a token number in each iteration as 786,416 and train the model for 400K steps and evaluate image quality using COCO-val-2014 [41] benchmark.
Dataset Splits Yes We evaluate Ni T on standard 256x256 and 512x512 benchmarks... For high-resolution generalization evaluation, experiments are conducted on four resolutions: {768x768, 1024x1024, 1536x1536, 2048x2048}. For aspect ratio generalization analysis, experiments are conducted on six aspect ratios: {1:3, 9:16, 3:4, 4:3, 16:9, 3:1}. The corresponding resolutions are: {320x960, 416x768, 480x640, 640x480, 768x416, 960x320}. We evaluate image quality using COCO-val-2014 [41] benchmark.
Hardware Specification Yes We compare the training and inference efficiency on the Image Net-256 benchmark using a single NVIDIA A100 GPU
Software Dependencies No We use Flash Attention-2 [15] to achieve this efficiently. The paper mentions a specific library but does not provide version numbers for core software components (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup Yes We use DC-AE [12] with a 32 down-sampling scale and 32 latent dimensions as our image encoder... For class-guided image generation, we use 131,072 tokens in one iteration. Unless otherwise stated, all results in Tabs. 1 to 3 are evaluated with the Ni T model trained for 1000K steps (corresponds to 131B token budgets)... All the results are reported with the utilization of classifier-free-guidance (CFG)... CFG scale is set as 1.5. Table 8: Detailed Quantitative Results of Ni T-XL. We further provide the CFG scale and interval for each experiment.