Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Authors: Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier, Akshay Chaudhari, Curtis Langlotz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use this framework to demonstrate that TROVE can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TROVE to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previously-unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time.
Researcher Affiliation Collaboration 1Stanford University 2HOPPR EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical formulas in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Stanford-AIMI/TRo Ve.
Open Datasets Yes Kinetics400 [40]... Kinetics400 is open-source. Pneumonia Progression Classification on MS-CXR-T [8]... MS-CXR-T is available under Physio Net Credentialed Health Data License 1.5.0. 600-class activity recognition on Kinetics600 [51]...174-class fine-grained activity recognition on Something Something V2 [52].
Dataset Splits No We split the validation set of Kinetics400 into a development set (used for the analysis in this section) and a test set (used for the analysis in Section 5.2 where we mitigate biases). The paper mentions splitting the validation set but does not provide specific percentages, sample counts, or references to predefined splits for these custom sets.
Hardware Specification Yes Training is performed on a single NVIDIA V100 GPU using a batch size of 256, an initial learning rate of 1e-4, and a total of 100 epochs with early stopping based on validation set performance. ... We implement TROVE using a single NVIDIA V100 GPU.
Software Dependencies No Model F is implemented in the form of a simple contrastive VLM where the vision and text encoders are based on the CLIP Vi T-L/14 architecture. The paper mentions a model architecture (CLIP Vi T-L/14) but does not provide specific version numbers for software libraries, programming languages, or other key software components used for implementation.
Experiment Setup Yes Training is performed on a single NVIDIA V100 GPU using a batch size of 256, an initial learning rate of 1e-4, and a total of 100 epochs with early stopping based on validation set performance. ... The optimal number of clusters is selected automatically by sweeping across a range of potential values [|Y| 2, |Y| 6) at increments of 400; here, the bounds of the range evaluate to [800, 2400), given the fact that |Y| = 400. ... For each considered cluster C, we utilize an SGD optimizer with a learning rate of 0.002 and train for 20 epochs.