Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Authors: Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng YAN

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed MLLM (SETOKIM) equipped with Se Tok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is https://sqwu.top/Se Tok-web/.
Researcher Affiliation	Collaboration	1National University of Singapore, 2Byte Dance Seed, 3Nanyang Technological University, 4Skywork AI
Pseudocode	Yes	The formal token clustering algorithm is described in Algorithm 1. Specifically, a scope z = [0, 1]h w is initialized to a matrix of ones 1h w to track the degree to which visual embeddings have been assigned to clusters. In addition, the seed scores are initialized by combining the local density in Eq.(1) and distance in Eq.(2) to perform the selection of visual embeddings. At each iteration, a single embedding vector xi,j is selected at the spatial location (i, j) which corresponds to the argmax of the element-wise multiplication of the seed scores and the current scope. This ensures that cluster seeds are sampled from pixel embeddings that have not yet been assigned to clusters. An alpha mask αc [0, 1]h w is computed as the distance between the cluster seed embedding xi,j and all individual pixel embeddings according to a distance kernel φ. The output of the kernel φ is one if two embeddings are identical and decreases to zero as the distance between a pair of embeddings increases. Additionally, a negative penalty βs is applied to the alpha mask by misusing the seed scores, where β is a hyper-parameter. This encourages the selection of elements similar to the current feature with lower information density. The associated concept mask Mc is obtained by the element-wise multiplication of the alpha masks by the current scope. An element-wise multiplication with the complement of the alpha masks then updates the scope. This process is repeated until a stopping condition is satisfied, at which point the final scope is added as an additional mask to explain any remaining embeddings. Algorithm 1 Token Clustering Algorithm Require: visual embeddings X Rh w d Ensure: masks M [0, 1]h w C with P c Mi,j,c = 1 1: Initialize: masks M = , scope z = 1h w, seed scores s Rh w 2: while not Stop Condition(M) do 3: (i, j) = arg max(z s) 4: α = sigmoid(φ(X, (i, j)) βs) 5: M.append(z α) 6: z = z (1 α) 7: end while 8: M.append(z)
Open Source Code	No	The project page is https://sqwu.top/Se Tok-web/.
Open Datasets	Yes	We use Image Net-1K (Deng et al., 2009) for reconstruction learning and Open Images (Kuznetsova et al., 2020) for both reconstruction and alignment learning. ... English text corpus from the Slim Pajama (Soboleva et al., 2023) dataset ... MSCOCO (Lin et al., 2014) ... ALLa VA (Chen et al., 2024) and LLa VA-665K (Liu et al., 2023c) ... VQAv2 (Goyal et al., 2019), GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), A-OKVQA (Schwenk et al., 2022) ... LAION-aesthetics (Schuhmann et al., 2022) ... Instruct Pix2Pix (Brooks et al., 2023) and Magicbrush (Zhang et al., 2024c). ... Flicker30K (Young et al., 2014) ... ref COCOg (Mao et al., 2016), ref COCO+ (Yu et al., 2016), and Reaseg (Lai et al., 2024).
Dataset Splits	Yes	We evaluate SETOKIM on various common visual language tasks, including visual question-answering, image generation & editing, and referring segmentation. Our results reveal that semantic-equivalent tokenization significantly enhances vision-language learning compared to standard patch-level tokenization or learnable queries, achieving higher performance on various tasks. ... For examining visual understanding ability, we evaluate our model on Flicker30K (Young et al., 2014), VQAv2(Goyal et al., 2019), GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), as well as three MLLM benchmarks, e.g., POPE (Li et al., 2023), MME (Fu et al., 2023) and MM-Vet (Yu et al., 2023b). Besides, we evaluate the visual generation fidelity on the MSCOCO (Lin et al., 2014) dataset. Following Pan et al. (2024), we evaluate the image editing capabilities of the SEKTOIM on Magicbrush (Zhang et al., 2024c), EVR (Tan et al., 2019) and MA5K (Shi et al., 2021). Furthermore, ref COCOg (Mao et al., 2016), ref COCO+ (Yu et al., 2016), and Reaseg (Lai et al., 2024) are utilized to examine the potential referring segmentation capabilities of the proposed model.
Hardware Specification	Yes	All training is conducted on 64 H100 (80G) GPUs.
Software Dependencies	No	No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned.
Experiment Setup	Yes	Table 9: Training recipes for Se Tok, SETOKIM of Stage-I: Multimodal Pretraining and Stage-II: End-to-end Instruction Tuning. Optimizer Adam W Adam W Adam W Precision bfloat16 bfloat16 bfloat16 Peak learning rate of LLM 5e-5 5e-5 Peak learning rate of Visual Part 5e-4 1e-4 2e-4 Weight Decay 0.05 0.1 0.01 Learning Rate Scheduler Cosine Cosine Cosine LR Warmup Steps 10K 2K 5K Input image resolution 384 384 384 384 384 384 Batch Size Per GPU 16 16 16 Gradient Accumulation Steps 8 8 8 Maximum Token Length 2048 2048