Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Subobject-level Image Tokenization
Authors: Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Intrinsic evaluations across 5 datasets demonstrate that EPOC s segmentation aligns well with human annotations of both objectand part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens. |
| Researcher Affiliation | Collaboration | 1Meta FAIR Paris 2The Hong Kong University of Science and Technology 3Alibaba Group 4Zillow. |
| Pseudocode | No | The paper describes the EPOC method in Section 3.2.3 and visually in Figure 2, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project website: https://github. com/Chen Delong1999/subobjects. |
| Open Datasets | Yes | Intrinsic evaluations across 5 datasets... with COCO s COCONut relabeled validation split (Deng et al., 2024) and ADE-20K (Zhou et al., 2019) validation split provide object-level annotations, and Pascal Panoptic Parts (PPP) (de Geus et al., 2021), Part Image Net++ (PIN++) (Li et al., 2024) and SA-1B (Kirillov et al., 2023) consist subobject-level annotations. ...Image Net-1K (Deng et al., 2009)... Share GPT4V (Chen et al., 2024)... Pixmo-cap (Deitke et al., 2024)... CLEVR-cap generated from CLEVR (Johnson et al., 2017)... |
| Dataset Splits | Yes | For COCO, PPP, PIN++, and SA1B, we randomly sample 3k images for efficient evaluation. For ADE-20K, we include all 2k samples in the validation set. ...Image Net-1k and CLEVR provide official validation splits, we use 5k samples from them for efficiency. For Pixmo-cap, we randomly sample 5k samples as validation, and for Share GPT-4v, we treat 5k samples randomly selected from the GPT-4V generated captions as validation split. |
| Hardware Specification | Yes | We measure throughput with a V100 (32GB) an 30 CPU cores... The training was performed on a single NVIDIA 8 A100 machine. |
| Software Dependencies | No | The paper mentions using a SegFormer-b0 model and the scikit-image library, but it does not specify version numbers for these or other key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use a two-layer MLP as the connector between embeddings and LLM. The width is 4 of LLM s hidden state dimension. We freeze the image feature extractor and do end-to-end fine-tune the small MLP projection plus the LLM. For CLEVR-cap, Image Net-1k, Share GPT4V, and Pixmo-cap datasets, we respectively train the model for 30, 1, 1, 3 epochs, with a batch size of 512, 256, 256, and 256. Max tokens are set to 100 for EPOC and 64 for Mask2Former tokenizer. We use Adam W with learning rate 1 10 4, cosine decay or constant scheduling, and 500 warmup steps. Mixed-precision (bf16) is used to accelerate training. |