Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Universal Visuo-Tactile Video Understanding for Embodied Interaction

Authors: Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains. 4 Experiments Our experiments utilize the proposed VTV150K dataset for both training and evaluation protocols.
Researcher Affiliation Academia Yifan Xie1, Mingyang Li1, Shoujie Li1, Xingting Li1, Guangyu Chen2, Fei Ma3 , Fei Yu3, Wenbo Ding1 1 Tsinghua Shenzhen International Graduate School, Tsinghua University 2 Shenzhen Campus of Sun Yat-sen University 3 Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Pseudocode No The paper describes methods and training paradigms in prose and with diagrams (e.g., Figure 2 and Figure 3) but does not present any structured pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Although no link is available for our work now, we are strongly intended to share our work in public when the paper is accepted.
Open Datasets No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Although no link is available for our work now, we are strongly intended to share our work in public when the paper is accepted. The paper introduces VTV150K, a comprehensive dataset, but concrete public access information (link, DOI, repository) is not yet provided.
Dataset Splits Yes For Stage 2 and 3, we utilize two independently generated sets of 10,000 question-answer pairs to prevent data leakage between stages. To evaluate model performance, we create a separate test set comprising 600 question-answer pairs for novel objects not present in the training data, ensuring comprehensive coverage across various tactile reasoning tasks.
Hardware Specification Yes All experiments are conducted on 4 NVIDIA RTX 6000 Ada GPUs. Additional implementation details and hyperparameter configurations are provided in the Supplementary Material ??.
Software Dependencies No At the core of VTV-LLM lies a (Qwen 2.5 [4, 5]) that synthesizes complex multi-modal information from visuo-tactile videos, utilizing world knowledge to generate coherent, human-readable descriptions of tactile attributes. The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup No Our LLM backbone is based on Qwen 2.5 [4, 5], experimenting with three model variants (3B, 7B, and 14B parameters). All experiments are conducted on 4 NVIDIA RTX 6000 Ada GPUs. Additional implementation details and hyperparameter configurations are provided in the Supplementary Material ??.