Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Authors: Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that ALTo LLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALTo LLM.
Researcher Affiliation Collaboration 1Uni-Ubi 2Zhejiang University 3Tongji University Equal contributions Corresponding author Work done during internship at Uni-Ubi EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical details and training insights in Appendix A, including equations for loss functions and gradients, but it does not present a clearly labeled pseudocode or algorithm block.
Open Source Code No The code and data are not released at submission time but will be made available in the future.
Open Datasets Yes For stages 1 and 1.5, we construct the training and validation sets of Multi-Target-SA1B from the SA1B dataset... For stage 3, we use Multi-Target-SA1B, the Ref COCO series [48, 49], and g Ref COCO [50]... including ADE20K (A-150) [62], PASCAL Context59 (PC-59) [63], and PASCAL VOC 20 (PAS-20) [64]
Dataset Splits Yes For stages 1 and 1.5, we construct the training and validation sets of Multi-Target-SA1B from the SA1B dataset by randomly selecting multiple masks from all annotations for each image. This approach yields complex multi-target masks, facilitating the learning of expressive mask representations by ALTo. For stage 2, we used all Hi MTok and Multi-Target-SA1B datasets for SFT. For Multi-Target-SA1B, we input the bounding boxes of all targets as <box>[[],[],...]</box> . To ensure that the model supports both fixed-length and adaptive-length prompts, we randomly assign half of the data to each prompt type, as detailed in the Appendix. B. For stage 3, we use Multi-Target-SA1B, the Ref COCO series [48, 49], and g Ref COCO [50] to maintain complex mask representation and language understanding during RL.
Hardware Specification Yes Stages 1, 1.5, and 3 are trained on 8 A100 GPUs (80GB each), and stage 2 on 16 A100 GPUs.
Software Dependencies No The paper does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes ALTo processes input and reconstructs masks at 256 256 resolution. During training and inference, the MLLM processes images at 448 448, while the pixel encoder encodes image at 1024 1024. In stage 1.5, the feature dimension of TLP is set to 1024, consistent with MT. The length penalty coefficient is set to 0.0001, 0.001, 0.01, or 0.1, among which 0.01 is found to be optimal in subsequent experiments and is chosen for later stages. Stage 3 trains the RL model based on the stage 2 checkpoint, with the length penalty set to 1e-2, 5e-3, 3e-3, 2e-3, 1e-3, or 1e-4, which are compared in later experiments. The KL penalty is set to 1e-3. We sample 12 group responses with a temperature of 1 and top-k of 10.