CoAtNet: Marrying Convolution and Attention for All Data Sizes
Authors: Zihang Dai, Hanxiao Liu, Quoc V Le, Mingxing Tan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our Co At Nets achieve state-of-the-art performance under different resource constraints across various datasets |
| Researcher Affiliation | Industry | Google Research, Brain Team {zihangd,hanxiaol,qvl,tanmingxing}@google.com |
| Pseudocode | No | The paper describes methods and architectures but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | we utilize three datasets of increasingly larger sizes, namely Image Net-1K (1.28M images), Image Net-21K (12.7M images) and JFT (300M images). |
| Dataset Splits | No | The paper uses Image Net-1K and JFT datasets and discusses training and evaluation, but it does not explicitly state the specific training/validation/test split percentages, sample counts, or refer to predefined splits with citations in a way that provides concrete split information for reproducibility. |
| Hardware Specification | Yes | On our accelerator of choice (TPU), such operation turns out to be extremely slow [34]" and "TPUv3-core-days denotes the pretraining time |
| Software Dependencies | No | The paper discusses various models and techniques but does not provide specific software versions (e.g., deep learning framework versions, Python versions, or library versions) used for implementation. |
| Experiment Setup | Yes | For all Conv and MBConv blocks, we always use the kernel size 3. For all Transformer blocks, we set the size of each attention head to 32, following [22]. The expansion rate for the inverted bottleneck is always 4 and the expansion (shrink) rate for the SE is always 0.25." and "we first pre-train our models on each of the three datasets at resolution 224 for 300, 90 and 14 epochs respectively. Then, we finetune the pre-trained models on Image Net-1K at the desired resolutions for 30 epochs |