Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Authors: Mehrdad Noori, David OSOWIECHI, Gustavo Vargas Hakim, Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Farzad Beizaee, Ismail Ayed, Christian Desrosiers

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.
Researcher Affiliation	Academia	LIVIA, ÉTS Montréal, Canada International Laboratory on Learning Systems (ILLS) Correspondence to EMAIL and EMAIL
Pseudocode	No	The paper includes a figure (Figure 2: Overview of our MLMP method) that illustrates the pipeline, but it does not contain structured pseudocode or algorithm blocks. The methodology is described in text and mathematical equations.
Open Source Code	Yes	Code and data are available at https://github.com/dosowiechi/MLMP.
Open Datasets	Yes	Our experiments are conducted on Pascal VOC 20 (v20), Pascal VOC 21 (v21) [40], Pascal Context 59 (P59), Pascal Context 60 (P60) [41], and Cityscapes [42], incorporating both original version (clean) and the synthetic 15 corruptions (denoted with a -C suffix). For COCO-Stuff [43] and COCO-Object [44], we use only the original versions. To further evaluate robustness under real and rendered distributional shifts, we additionally include ACDC [45] capturing real-world adverse conditions such as fog, night, rain, and snow and GTA-V [46], which provides photorealistic, game-rendered urban scenes.
Dataset Splits	No	The paper describes using standard datasets and applying corruptions to create test scenarios (e.g., "87 distinct test scenarios"). It mentions processing images for evaluation (e.g., resizing to 224x224, splitting Cityscapes images into patches), but it does not explicitly provide the train/test/validation splits used for the original model training or specific partitioning details for their evaluation beyond using original/corrupted versions of datasets.
Hardware Specification	Yes	All experiments are conducted on NVIDIA V100 GPUs equipped with 32GB memory.
Software Dependencies	No	We implement our approach using the Py Torch deep learning framework.
Experiment Setup	Yes	The adaptation process is carried out over 10 iterations using the Adam optimizer with a constant learning rate of 10 3 across all datasets. We use a batch size of 2 images during adaptation across all datasets.