Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Panoptic Captioning: An Equivalence Bridge for Image and Text
Authors: Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named Pancap Engine to produce high-quality data and a novel method named Pancap Chain to improve panoptic captioning. Specifically, our Pancap Engine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our Pancap Chain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named Pancap Score and a human-curated test set for reliable model evaluation. Experiments show that our Pancap Chain-13B model can beat state-of-the-art opensource MLLMs like Intern VL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. |
| Researcher Affiliation | Academia | Kun-Yu Lin Hongjun Wang Weining Ren Kai Han Visual AI Lab, The University of Hong Kong EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the multi-stage Pancap Chain method with components like Entity Instance Localization, Semantic Tag Assignment, Extra Instance Discovery, and Panoptic Caption Generation, detailing the input/output for each stage. However, it presents these as descriptive paragraphs and structured textual explanations rather than formal, code-like pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | We will release the code and data upon acceptance of the paper. ... Readers can access our code and data via the link to our project page. Project page: https://visual-ai.github.io/pancap/ |
| Open Datasets | Yes | Based on our Pancap Engine, we contribute a new SA-Pancap benchmark for the panoptic captioning task. We select SA-1B [101] as the data source due to its high image quality and data diversity. |
| Dataset Splits | Yes | Overall, our SA-Pancap benchmark consists of 9,000 training and 500 validation images paired with auto-generated panoptic captions, and 130 test images paired with human-curated panoptic captions. |
| Hardware Specification | Yes | All experiments are implemented using 4 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions specific models like Qwen2.5-14B [106] for semantic content extraction and question answering, and that their model adopts the LLaVA architecture [34] and is initialized with ASMv2-13B [85]. It also mentions using LoRA and AdamW. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Our model adopts the general LLa VA architecture [34], and it is initialized using the pre-trained ASMv2-13B [85] checkpoint... We finetune our model using Lo RA (rank r 128 and α 256), and optimize using Adam W (batch size 128, learning rate 2e-4). During inference, our model employs greedy decoding for caption generation. We train our model on the training set of our SA-Pancap for two epochs, and conduct evaluation on the validation and test sets. For our Pancap Score metric, we use Qwen2.5-14B [106] as the LLM for semantic content extraction and question answering. The threshold for measuring semantic consistency is set as δt 0.5, the threshold for location consistency is set as δl 0.5, and the weight coefficient λg is set as 0.1. |