A Touch, Vision, and Language Dataset for Multimodal Alignment

Authors: Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https: //tactile-vlm.github.io.
Researcher Affiliation Collaboration 1UC Berkeley 2Meta AI 3TU Dresden.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in prose and with diagrams, but not in a pseudocode format.
Open Source Code Yes Code, checkpoints and data are available on https: //tactile-vlm.github.io.
Open Datasets Yes In this work, we present the Touch-Vision-Language (TVL) dataset, a novel dataset consisting of 44K paired visiontactile observations, where 10% of the data are annotated by humans while the rest are labeled by GPT-4V. ... Code, checkpoints and data are available on https: //tactile-vlm.github.io.
Dataset Splits No The paper states, "We perform a 99%-1% train-test split across both dataset components..." but does not specify an explicit split percentage or sample count for a separate validation set needed for full reproducibility. It mentions a "validation set" in Table 3 footnote but without details on its derivation.
Hardware Specification Yes All experiments are run on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions optimizers (AdamW) and models (Open CLIP, LLa MA2 7B) but does not provide specific version numbers for software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions.
Experiment Setup Yes Config Value optimizer Adam W (Loshchilov & Hutter, 2017b) base learning rate 1.5e-4 learning rate schedule cosine decay (Loshchilov & Hutter, 2017a) batch size 256 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.95 (Chen et al., 2020) warm up epoch (Goyal et al., 2017) 10 total epochs 200