Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Authors: Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations demonstrate that VRWKV surpasses Vi T s performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV s potential as a more efficient alternative for visual perception tasks. We comprehensively evaluate the substitutability of our VRWKV for Vi T in performance, scalability, flexibility, and efficiency. The model s effectiveness is validated in image classification, object detection, and semantic segmentation tasks. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong, 2Shanghai AI Laboratory, 3Fudan University, 4Nanjing University, 5Tsinghua University, 6Sense Time Research |
| Pseudocode | Yes | States Recurrence Relation Initial Value a at = w at 1 + ektvt a 1 = 0 b bt = w+ (bt 1 ekt+1vt+1) b 1 = PT 1 i=1 e (i 1)w+kivi c ct = w ct 1 + ekt c 1 = 0 d dt = w+ (dt 1 ekt+1) d 1 = PT 1 i=1 e (i 1)w+ki |
| Open Source Code | Yes | Code and models are available at https://github.com/Open GVLab/Vision-RWKV. |
| Open Datasets | Yes | These models are trained using large-scale datasets such as Image Net-1K (Deng et al., 2009) and Image Net-22K (Deng et al., 2009). In addition, on COCO (Lin et al., 2014), a challenging downstream benchmark, our best model VRWKV-L achieves 50.6% box m AP, 1.9 points better than Vi T-L (50.6 vs 48.7). All models are trained for 160k iterations on the training set of the ADE20K dataset (Zhou et al., 2017). |
| Dataset Splits | Yes | These models are trained using large-scale datasets such as Image Net-1K (Deng et al., 2009) and Image Net-22K (Deng et al., 2009). Results. In Table 3, we report the detection results on the COCO val (Lin et al., 2014) dataset using VRWKV and Vi T as backbones. All models are trained for 160k iterations on the training set of the ADE20K dataset (Zhou et al., 2017). |
| Hardware Specification | Yes | The results were tested on an Nvidia A100 GPU, as shown in Figure 1. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and frameworks like Py Torch, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Following the training strategy and data augmentation of Dei T (Touvron et al., 2021a), we use a batch size of 1024, Adam W (Loshchilov & Hutter, 2017) with a base learning rate of 5e-4, weight decay of 0.05, and cosine annealing schedule (Loshchilov & Hutter, 2016). Images are cropped to the resolution of 224 224 for training and validation. |