Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Authors: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Bangalath, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Seungwhan Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman H Khan, Philipp Kraehenbuehl, Piotr Dollar, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M humanlabeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM Video Bench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about what , where , when , and how of a video. We evaluate PLM on a total of 20 image benchmarks. We perform an ablation study to assess the importance of each of our proposed data, both synthetic and human-annotated.
Researcher Affiliation Collaboration Jang Hyun Cho1,2, , , Andrea Madotto1, , Effrosyni Mavroudi1, , Triantafyllos Afouras1, , Tushar Nagarajan1, , Muhammad Maaz3, , , Yale Song1, , Tengyu Ma1, , Shuming Hu1, , Suyog Jain1, Miguel Martin1, Huiyu Wang1, Hanoona Rasheed3, , Peize Sun1, Po-Yao Huang1, Daniel Bolya1, Nikhila Ravi1, Shashank Jain4, Tammy Stark4, Shane Moon4, Babak Damavandi4, Vivian Lee1, Andrew Westbury1, Salman Khan3, Philipp Krähenbühl2, Piotr Dollár1, Lorenzo Torresani1, , Kristen Grauman1,2, , Christoph Feichtenhofer1, 1Meta 2UT Austin 3MBZUAI 4Meta Reality Labs
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We make our work fully reproducible by providing data, training recipes, code and models at https://github.com/facebookresearch/perception_models.
Open Datasets Yes To bridge these gaps, we release 2.8M humanlabeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM Video Bench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about what , where , when , and how of a video. We make our work fully reproducible by providing data, training recipes, code and models at https://github.com/facebookresearch/perception_models. Table 10: PLM training datamix. Our mix includes synthetic and manually annotated data across a combination of image data (QA, captioning, OCR, Visual grounding), video data (captioning, grounded captioning, dense captioning, temporal localization) and text-only data. Importantly, all data is publicly accessible, and not generated by proprietary models.
Dataset Splits Yes Table 2: Summary of the data mix for training PLM. See Table 10 for the full data blend. We created a balanced validation and test split by the combination of tube categories and number of caption per tube while making sure no video overlaps with the training set. (Appendix G.2) The test split refers to the FGQA module of PLM Video Bench. (Table 21)
Hardware Specification Yes The full training process for the PLM-3B model, including all three stages, takes approximately 5.5 days on 128 H100 GPUs. The PLM-1B model requires less training time, while the PLM-8B model takes slightly longer. The detailed breakdown is listed in the Table 7 . Table 7: Training efficiency of PLM models across scales. PLM-1B (PE L/14) 8 GPUs & 3 hours 128 GPUs & 1.5 Days 128 GPUs & 2.0 Days PLM-3B (PE L/14) 8 GPUs & 4 hours 128 GPUs & 3.0 Days 128 GPUs & 2.5 Days PLM-8B (PE G/14) 8 GPUs & 6 hours 256 GPUs & 3.0 Days 256 GPUs & 2.5 Days
Software Dependencies No For all three stages, we use Adam W optimizer [128] with weight decay of 0.05 and use FSDP [129] with Flash Attention2 [130] for overall implementation based on Py Torch [131].
Experiment Setup Yes For all three stages, we use Adam W optimizer [128] with weight decay of 0.05 and use FSDP [129] with Flash Attention2 [130] for overall implementation based on Py Torch [131]. Stage 1 training. In stage 1, we use a subset of SA-1B [14] paired with detailed captions generated by our data engine ( 3.1). We use total 1M samples to train PLM with next token prediction loss, with vision encoder and LLM parameters frozen. This stage is commonly known as warm-up stage. We use learning rate 1 10 4 for all model scale with global batch size of 512 and 448 448 resolution. Stage 2 training. Next, we train on a total of 72.5M samples. Of these, 66M consist of images and videos with synthetically generated annotations produced by our data engine. The remaining 6.5M samples are a subset of human-annotated images and videos from open-source datasets, which are included in our final datamix described in A.2. We train with global batch size of 2048, learning rate of 4 10 5, weight decay of 0.05 for the full set of parameters (vision encoder, projector, and LLM).