AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation
Authors: Anil Kag, n n, Jierun Chen, Junli Cao, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Jian Ren
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce As CAN a hybrid architecture, combining both convolutional and transformer blocks. ... As CAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. |
| Researcher Affiliation | Industry | Anil Kag Huseyin Coskun Jierun Chen Junli Cao Willi Menapace Aliaksandr Siarohin Sergey Tulyakov Jian Ren Snap Inc. Project Page: https://snap-research.github.io/snap_image Work done during an internship at Snap Inc. |
| Pseudocode | No | The paper includes architectural diagrams and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | No | We plan to release our asymmetric architecture implementation for facilitating future research in this direction. |
| Open Datasets | Yes | For the image classification task, we perform extensive latency analysis on the Image Net-1K dataset and show our models achieve superior throughput-performance trade-offs than existing works (see Fig. 3). |
| Dataset Splits | Yes | ADE20K [104] is a popular scene-parsing dataset used to evaluate the semantic segmentation performance. It consists of 20K train and 2K validation images over 150 fine-grained semantic categories. |
| Hardware Specification | Yes | Fig. 3 plots the inference speed on a V100 GPU with batch size 16 (measured in images processed per second) and top-1 accuracy achieved on this task for various models. In addition, Appendix Tab. 7 shows the parameter count and inference speed on both V100 and A100 GPUs... |
| Software Dependencies | No | The paper mentions 'torch-compile' and 'benchmark utility from timm library [100]' but does not provide specific version numbers for PyTorch, timm, or other key software dependencies. |
| Experiment Setup | Yes | For training Image Net-1K models with 224 224 resolution, we use the Adam W optimizer with a peak learning rate of 3e 3 for 300 epochs. We use a batch size of 4096 images during this training period. We follow a cosine schedule for decaying the learning rate to the minimum learning rate of 5e 6. |