Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Authors: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on class-conditional Image Net benchmarks. Di Co outperforms recent diffusion models in both generation quality and speed. Furthermore, the purely convolutional Di Co demonstrates strong potential in text-to-image generation.
Researcher Affiliation Collaboration Yuang Ai1,2 Qihang Fan1,2 Xuefeng Hu3 Zhenheng Yang3 Ran He1,2 Huaibo Huang1,2 1CASIA 2UCAS 3Byte Dance Corresponding author: Huaibo Huang <EMAIL>
Pseudocode Yes We provide its detailed Py Torch implementation in Algorithm 1. Algorithm 1 Py Torch code of text conditional depthwise convolution import torch import torch.nn.functional as F def text_conditional_dwconv(x, context): # x: (B, C, H, W) input feature maps # context: (B, 77, C) CLIP text embeddings after an MLP # output: (B, C, H, W) output after depthwise convolution B, C, H, W = x.shape context_pad = torch.cat([context, context[:, -1:].expand(-1, 4, -1)], dim=1) # (B, 81, C) kernels = context_pad.reshape(B, 9, 9, C).permute(0, 3, 1, 2).reshape(B * C, 1, 9, 9) x_flat = x.view(1, B * C, H, W) output = F.conv2d(x_flat, kernels, padding=4, groups=B * C).view(B, C, H, W) return output
Open Source Code Yes Code and models: https://github.com/shallowdream204/Di Co
Open Datasets Yes Following previous works [62, 100, 81], we conduct experiments on classconditional Image Net-1K [13] generation benchmark at 256 256 and 512 512 resolutions.
Dataset Splits Yes Following previous works [62, 100, 81], we conduct experiments on classconditional Image Net-1K [13] generation benchmark at 256 256 and 512 512 resolutions.
Hardware Specification Yes All experiments are conducted on NVIDIA A100 (80G) GPUs.
Software Dependencies No Algorithm 1 Py Torch code of text conditional depthwise convolution import torch import torch.nn.functional as F All these metrics are computed using Open AI s Tensor Flow evaluation toolkit [14].
Experiment Setup Yes For Di Co-S/B/L/XL, we adopt exactly the same experimental settings as used for Di T. Specifically, we employ a constant learning rate of 1 10 4, no weight decay, and a batch size of 256. The only data augmentation applied is random horizontal flipping. We maintain an exponential moving average (EMA) of the Di Co weights during training, with a decay rate of 0.9999. The pre-trained VAE [67] is used to extract latent features. For our largest model, Di Co-H, we follow the training settings of U-Vi T [6], increasing the learning rate to 2 10 4 and scaling the batch size to 1024 to accelerate training. Additional details are provided in Appendix Sec. B.