Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Authors: Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng YAN
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed MLLM (SETOKIM) equipped with Se Tok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is https://sqwu.top/Se Tok-web/. |
| Researcher Affiliation | Collaboration | 1National University of Singapore, 2Byte Dance Seed, 3Nanyang Technological University, 4Skywork AI |
| Pseudocode | Yes | The formal token clustering algorithm is described in Algorithm 1. Specifically, a scope z = [0, 1]h w is initialized to a matrix of ones 1h w to track the degree to which visual embeddings have been assigned to clusters. In addition, the seed scores are initialized by combining the local density in Eq.(1) and distance in Eq.(2) to perform the selection of visual embeddings. At each iteration, a single embedding vector xi,j is selected at the spatial location (i, j) which corresponds to the argmax of the element-wise multiplication of the seed scores and the current scope. This ensures that cluster seeds are sampled from pixel embeddings that have not yet been assigned to clusters. An alpha mask αc [0, 1]h w is computed as the distance between the cluster seed embedding xi,j and all individual pixel embeddings according to a distance kernel φ. The output of the kernel φ is one if two embeddings are identical and decreases to zero as the distance between a pair of embeddings increases. Additionally, a negative penalty βs is applied to the alpha mask by misusing the seed scores, where β is a hyper-parameter. This encourages the selection of elements similar to the current feature with lower information density. The associated concept mask Mc is obtained by the element-wise multiplication of the alpha masks by the current scope. An element-wise multiplication with the complement of the alpha masks then updates the scope. This process is repeated until a stopping condition is satisfied, at which point the final scope is added as an additional mask to explain any remaining embeddings. Algorithm 1 Token Clustering Algorithm Require: visual embeddings X Rh w d Ensure: masks M [0, 1]h w C with P c Mi,j,c = 1 1: Initialize: masks M = , scope z = 1h w, seed scores s Rh w 2: while not Stop Condition(M) do 3: (i, j) = arg max(z s) 4: α = sigmoid(φ(X, (i, j)) βs) 5: M.append(z α) 6: z = z (1 α) 7: end while 8: M.append(z) |
| Open Source Code | No | The project page is https://sqwu.top/Se Tok-web/. |
| Open Datasets | Yes | We use Image Net-1K (Deng et al., 2009) for reconstruction learning and Open Images (Kuznetsova et al., 2020) for both reconstruction and alignment learning. ... English text corpus from the Slim Pajama (Soboleva et al., 2023) dataset ... MSCOCO (Lin et al., 2014) ... ALLa VA (Chen et al., 2024) and LLa VA-665K (Liu et al., 2023c) ... VQAv2 (Goyal et al., 2019), GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), A-OKVQA (Schwenk et al., 2022) ... LAION-aesthetics (Schuhmann et al., 2022) ... Instruct Pix2Pix (Brooks et al., 2023) and Magicbrush (Zhang et al., 2024c). ... Flicker30K (Young et al., 2014) ... ref COCOg (Mao et al., 2016), ref COCO+ (Yu et al., 2016), and Reaseg (Lai et al., 2024). |
| Dataset Splits | Yes | We evaluate SETOKIM on various common visual language tasks, including visual question-answering, image generation & editing, and referring segmentation. Our results reveal that semantic-equivalent tokenization significantly enhances vision-language learning compared to standard patch-level tokenization or learnable queries, achieving higher performance on various tasks. ... For examining visual understanding ability, we evaluate our model on Flicker30K (Young et al., 2014), VQAv2(Goyal et al., 2019), GQA (Hudson & Manning, 2019), OK-VQA (Marino et al., 2019), as well as three MLLM benchmarks, e.g., POPE (Li et al., 2023), MME (Fu et al., 2023) and MM-Vet (Yu et al., 2023b). Besides, we evaluate the visual generation fidelity on the MSCOCO (Lin et al., 2014) dataset. Following Pan et al. (2024), we evaluate the image editing capabilities of the SEKTOIM on Magicbrush (Zhang et al., 2024c), EVR (Tan et al., 2019) and MA5K (Shi et al., 2021). Furthermore, ref COCOg (Mao et al., 2016), ref COCO+ (Yu et al., 2016), and Reaseg (Lai et al., 2024) are utilized to examine the potential referring segmentation capabilities of the proposed model. |
| Hardware Specification | Yes | All training is conducted on 64 H100 (80G) GPUs. |
| Software Dependencies | No | No specific software versions (e.g., Python, PyTorch, CUDA versions) are mentioned. |
| Experiment Setup | Yes | Table 9: Training recipes for Se Tok, SETOKIM of Stage-I: Multimodal Pretraining and Stage-II: End-to-end Instruction Tuning. Optimizer Adam W Adam W Adam W Precision bfloat16 bfloat16 bfloat16 Peak learning rate of LLM 5e-5 5e-5 Peak learning rate of Visual Part 5e-4 1e-4 2e-4 Weight Decay 0.05 0.1 0.01 Learning Rate Scheduler Cosine Cosine Cosine LR Warmup Steps 10K 2K 5K Input image resolution 384 384 384 384 384 384 Batch Size Per GPU 16 16 16 Gradient Accumulation Steps 8 8 8 Maximum Token Length 2048 2048 |