A Touch, Vision, and Language Dataset for Multimodal Alignment
Authors: Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) tactile-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code, checkpoints and data are available on https: //tactile-vlm.github.io. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Meta AI 3TU Dresden. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in prose and with diagrams, but not in a pseudocode format. |
| Open Source Code | Yes | Code, checkpoints and data are available on https: //tactile-vlm.github.io. |
| Open Datasets | Yes | In this work, we present the Touch-Vision-Language (TVL) dataset, a novel dataset consisting of 44K paired visiontactile observations, where 10% of the data are annotated by humans while the rest are labeled by GPT-4V. ... Code, checkpoints and data are available on https: //tactile-vlm.github.io. |
| Dataset Splits | No | The paper states, "We perform a 99%-1% train-test split across both dataset components..." but does not specify an explicit split percentage or sample count for a separate validation set needed for full reproducibility. It mentions a "validation set" in Table 3 footnote but without details on its derivation. |
| Hardware Specification | Yes | All experiments are run on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions optimizers (AdamW) and models (Open CLIP, LLa MA2 7B) but does not provide specific version numbers for software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions. |
| Experiment Setup | Yes | Config Value optimizer Adam W (Loshchilov & Hutter, 2017b) base learning rate 1.5e-4 learning rate schedule cosine decay (Loshchilov & Hutter, 2017a) batch size 256 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.95 (Chen et al., 2020) warm up epoch (Goyal et al., 2017) 10 total epochs 200 |