AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

Authors: Anil Kag, n n, Jierun Chen, Junli Cao, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Jian Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce As CAN a hybrid architecture, combining both convolutional and transformer blocks. ... As CAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models.
Researcher Affiliation Industry Anil Kag Huseyin Coskun Jierun Chen Junli Cao Willi Menapace Aliaksandr Siarohin Sergey Tulyakov Jian Ren Snap Inc. Project Page: https://snap-research.github.io/snap_image Work done during an internship at Snap Inc.
Pseudocode No The paper includes architectural diagrams and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code No We plan to release our asymmetric architecture implementation for facilitating future research in this direction.
Open Datasets Yes For the image classification task, we perform extensive latency analysis on the Image Net-1K dataset and show our models achieve superior throughput-performance trade-offs than existing works (see Fig. 3).
Dataset Splits Yes ADE20K [104] is a popular scene-parsing dataset used to evaluate the semantic segmentation performance. It consists of 20K train and 2K validation images over 150 fine-grained semantic categories.
Hardware Specification Yes Fig. 3 plots the inference speed on a V100 GPU with batch size 16 (measured in images processed per second) and top-1 accuracy achieved on this task for various models. In addition, Appendix Tab. 7 shows the parameter count and inference speed on both V100 and A100 GPUs...
Software Dependencies No The paper mentions 'torch-compile' and 'benchmark utility from timm library [100]' but does not provide specific version numbers for PyTorch, timm, or other key software dependencies.
Experiment Setup Yes For training Image Net-1K models with 224 224 resolution, we use the Adam W optimizer with a peak learning rate of 3e 3 for 300 epochs. We use a batch size of 4096 images during this training period. We follow a cosine schedule for decaying the learning rate to the minimum learning rate of 5e 6.