Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that PAM delivers robust performance across a diverse range of regional understanding tasks for both images and videos, while operating 1.2 2.4 faster and consuming less GPU memory compared to prior models. |
| Researcher Affiliation | Academia | 1CUHK 2HKU 3Poly U 4Peking University |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in prose and through architectural diagrams. |
| Open Source Code | Yes | Code, model and data are available at: https://Perceive-Anything.github.io |
| Open Datasets | Yes | For regional recognition, we utilize multiple instance detection and segmentation datasets [55, 35, 40, 23, 50, 66], along with scene text recognition datasets [56, 31, 30, 19, 24, 14, 76, 57, 4]. We collected and analyzed several existing video datasets, including referring detection and segmentation datasets [71, 47, 18, 62, 58, 17, 85, 64], as well as the recent Sa2VA [79] annotations for the SAV [53] dataset. |
| Dataset Splits | Yes | Recognition performance is assessed on the validation sets of the LVIS (object-level) [23] and PACO (part-level) [50] datasets, alongside the test sets of COCO-Text [61] and Total-Text [14]. For evaluation, we primarily utilize the validation set of the Activity Net dataset [7]. |
| Hardware Specification | Yes | All training is conducted on 8 NVIDIA A100 GPUs with 80GB. Comparison of GPU memory usage and inference efficiency on an A6000 GPU. |
| Software Dependencies | Yes | We employ Qwen2.5-1.5B/3B [72] as our semantic decoder, and utilize the pre-trained hierarchical SAM 2-Large3 as the base vision foundation model. |
| Experiment Setup | Yes | The hyper-parameters for each training stage are summarized in Appendix A. Table 8 details the configurations for each training stage of the Perceive Anything Model (PAM). It outlines the vision parameters, dataset characteristics, model specifications, and training hyperparameters throughout the curriculum learning stages. The maximum number of visual tokens varies by input modality: single images are represented using 1024 tokens, while for videos, we sample up to 16 frames, leading to a maximum of 4864 visual tokens. A global batch size of 1024 is used for stages 1 and 1.5, and 256 for stage 2. |