Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CAT: Content-Adaptive Image Tokenization

Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, LILI YU, Chunting Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train CAT on a diverse set of images using LLM-evaluated compression ratios and conduct extensive evaluations across nine datasets, covering natural scenes (Image Net [13], COCO [14]), human faces (Celeb A [15]), and detail-heavy domains such as text (Chart QA [16], GTSRB [17], SVHN [18]), textures, and satellite imagery. On large-scale natural images, CAT preserves high reconstruction quality while reducing token usage by 18% compared to fix-token baselines. On complex images, CAT achieves significantly better reconstruction quiality, improving the r FID by 12% on Celeb A, 17% on GTSRB, 20% on SVHN, and 39% on Chart QA relative to fixed-token baselines. We also benchmark CAT on two critical downstream tasks:
Researcher Affiliation Collaboration Junhong Shen Carnegie Mellon University Kushal Tirumala FAIR at Meta Michihiro Yasunaga FAIR at Meta Ishan Misra FAIR at Meta Luke Zettlemoyer FAIR at Meta Lili Yu FAIR at Meta Chunting Zhou FAIR at Meta
Pseudocode No The paper describes the methodology in sections 3.2 'Complexity Evaluation via Captions and LLMs' and 3.3 'Nested VAE for Adaptive Comprssion' using descriptive text and architectural diagrams like Figure 1, but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes We provide our training code in the supplementary material. ... We use open-source benchmarks and release our code.
Open Datasets Yes We train CAT on a diverse set of images using LLM-evaluated compression ratios and conduct extensive evaluations across nine datasets, covering natural scenes (Image Net [13], COCO [14]), human faces (Celeb A [15]), and detail-heavy domains such as text (Chart QA [16], GTSRB [17], SVHN [18]), textures, and satellite imagery.
Dataset Splits Yes Evaluation datasets. We use four representative datasets: COCO [14] and Image Net [13] for natural images, Celeb A [15] and Chart QA [16] for perceptually challenging images. Table 1 shows the compression ratio distributions. ... We compute their reconstruction mean squared error (MSE) on 41K COCO 2014 [14] images with resolution 512.
Hardware Specification Yes All models including the baselines are trained using a global batch size of 512 on 64 NVIDIA A100 GPUs for 1M steps. ... All models are trained on 16 NVIDIA H100 GPUs for 400K steps, using a global token batch size of 262,144, which is equivalent to 1024 images at 16x compression.
Software Dependencies No We implement the nested VAE similar to the Autoencoder KL implementation in the diffusers library. This mentions a library but does not provide a specific version number.
Experiment Setup Yes We use the following training configuration: GPU: 64 NVIDIA A100 Per-GPU batch size: 8 Global batch size: 512 Training steps: 1,000,000 Optimizer: Adam W lr: 0.0001 beta1: 0.9 beta2: 0.95 weight_decay: 0.1 epsilon: 1e-8 gradient_clip: 5.0 Scheduler: constant with 10,000 warmup steps Loss: recon_loss_weight: 1.0 kl_loss_weight: 1e-6 perceptual_loss_weight: 1.0 moco_loss_weight: 0.2 gan_loss_weight: 0.5 gan_loss_starting_step: 50,000