Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models
Authors: Tiezheng Zhang, Yitong Li, Yu-Cheng Chou, Jieneng Chen, Alan L. Yuille, Chen Wei, Junfei Xiao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experimental results validate that the proposed captioner exhibits highly competitive captioning performance relative to So TA VLMs, such as GPT-4o, and surpasses other open-source models of comparable parameter counts. Additionally, we explore emergent properties of the proposed VLV autoencoder: a) semantic richness, where learned embeddings encode detailed semantic aspects, including object 3D pose and orientation, resulting in robust spatial consistency; and b) compositional generalization, achieved by concatenating caption embeddings from distinct images, allowing the model to disentangle foreground objects from backgrounds effectively and compose novel, coherent, and visually plausible images. Section 4, titled "Experiment," details the experimental setup and results, including quantitative results on text-to-image generation, human studies of caption quality, visual-question-answering (VQA) benchmarks, and ablation studies. |
| Researcher Affiliation | Academia | 1Johns Hopkins University 2Tsinghua University 3Rice University |
| Pseudocode | No | The paper describes the methodology using textual descriptions and architectural diagrams (Figure 2) but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The dataset is partially open-sourced, and we will release the implementation code upon the paper s acceptance. |
| Open Datasets | Yes | Data Collection. From LAION-2B-en-aesthetic, a subset of LAION-5B [56], we curate a 40M image subset. ... We assess caption quality by feeding each decoded caption to Stable Diffusion 3.5 Medium [20] and computing the Fréchet Inception Distance (FID) [25] between the synthesized and original images on 30K samples from the MS-COCO 2014 validation split [14]. ... we assess their effectiveness on open-ended vision language tasks using VQAv2 [23] and OK-VQA [46] validation sets. Section F.1 Training Datasets states: License: Creative Common CC-BY 4.0 https://laion.ai/blog/laion-5b/. Section F.2 Testing Datasets states: License: Creative Common CC-BY 4.0 https://cocodataset.org/#termsofuse, License: CC-BY 4.0 https://visualqa.org/terms.html, Dataset website: https://visualqa.org/index.htmll, License:N/A. Dataset website: https://okvqa.allenai.org/ |
| Dataset Splits | Yes | We assess caption quality by feeding each decoded caption to Stable Diffusion 3.5 Medium [20] and computing the Fréchet Inception Distance (FID) [25] between the synthesized and original images on 30K samples from the MS-COCO 2014 validation split [14]. ... we assess their effectiveness on open-ended vision language tasks using VQAv2 [23] and OK-VQA [46] validation sets. |
| Hardware Specification | Yes | Training runs for 200K steps with batch size 512 on 8 RTXTM 6000 Ada GPUs ( 4 days). |
| Software Dependencies | No | We use Adam W [44] optimizer with (β1, β2) = (0.9, 0.99) and a decoupled weight decay of 0.01. ... We use FP32 in autoencoder training to make models converge with stability, while the LLM decoder training uses BF16. The paper mentions Qwen-2.5 as a pretrained model for initializing the LLM decoder, and Florence-2 pretrained weights for the image encoder part, but does not specify versions for underlying software libraries (e.g., PyTorch, TensorFlow) or general programming language versions. |
| Experiment Setup | Yes | Training Details. When training our VLV auto-encoder, we initialize the image encoder part with Florence-2 [72] pretrained weights. The additional Nq = 77 learnable queries are randomly initialized. We use Adam W [44] optimizer with (β1, β2) = (0.9, 0.99) and a decoupled weight decay of 0.01. Training runs for 200K steps with batch size 512 on 8 RTXTM 6000 Ada GPUs ( 4 days). The learning rate starts at 5e-5 and follows a cosine schedule [43]. We use Qwen-2.5 [77] pretrained models for initializing the LLM decoder. We train the captioning decoder with 100K steps, having the batch size of 64. The learning rate decays linearly starting at 1e-5. We use FP32 in autoencoder training to make models converge with stability, while the LLM decoder training uses BF16. |