Learning to Jointly Understand Visual and Tactile Signals

Authors: Yichen Li, Yilun Du, Chao Liu, Chao Liu, Francis Williams, Michael Foshey, Benjamin Eckart, Jan Kautz, Joshua B. Tenenbaum, Antonio Torralba, Wojciech Matusik

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We collect a multi-modal visual-tactile dataset that contains paired full-hand force pressure maps and manipulation videos. We also propose a novel method to learn a cross-modal latent manifold that allows for cross-modal prediction and discovery of latent structure in different data modalities. We conduct extensive experiments to demonstrate the effectiveness of our method.
Researcher Affiliation Collaboration Yichen Li1, Yilun Du1, Chao Liu1, Chao Liu2, Francis Williams2, Michael Foshey1, Ben Eckart2, Jan Kautz2, Joshua B. Tenenbaum1, Antonio Torralba1, Wojciech Matusik1 1MIT CSAIL 2NVIDIA
Pseudocode No The paper describes the method using mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to a supplementary website: "For further references: https://sites.google.com/view/iclr-submission-force-vision/home?authuser=3", but it does not explicitly state that the source code for the described methodology is available at this link or elsewhere.
Open Datasets No The paper describes a dataset collected by the authors: "To this end, we collect a cross-modal force-vision dataset. We collect the active full-hand force pressure sequence data using specialized tactile sensor gloves (Sundaram et al., 2019b)." However, it does not provide any public access information (link, DOI, formal citation) for this dataset. The given link "https://sites.google.com/view/iclr-submission-force-vision/home?authuser=3" is for "further references" and does not explicitly offer dataset download.
Dataset Splits No The paper describes training and test sets but does not explicitly mention a separate validation set. For example: "Our training set contains 81 objects spanning 4 different categories, with paired tactile and video recordings of manipulation, containing 123,561 frames of data. Our test set is constructed with 3 different subset..."
Hardware Specification Yes We use a workstation equipped with NVIDIA Tesla V100 GPUs and 64-core AMD CPUs.
Software Dependencies Yes All baseline methods and our proposed method are implemented using the open-source Pytorch (Paszke et al., 2019) package with CUDA 11.7 backend.
Experiment Setup Yes All our manifolds are randomly initialized with Gaussian distribution with (µ, σ) = (0, 1). Tactile manifold Mh is initialized to be 16 dimensional and all other manifolds are initialized to be 256 dimensional. We use MLPs to express our neural fields Φ. All MLP neural fields are of three layers and 512 hidden dimension... We use Adam optimizer for training the baselines and our methods with learning rate initialized to be 1e 3; batch size is set to 64.