Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control

Authors: Dongyoon Hwang, Byungkun Lee, Hojoon Lee, Hyunseung Kim, Jaegul Choo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Co In with three distinct types of pretrained Vi Ts (CLIP, MVP, VC-1) across 12 varied control tasks within three separate domains (Adroit, Meta World, DMC), and demonstrate that Co In consistently enhances control task performance across all experimented environments and models, validating the effectiveness of providing pretrained Vi Ts with control-centric biases.
Researcher Affiliation Academia 1Kim Jaechul Graduate School of AI, KAIST. Correspondence to: Dongyoon Hwang <godnpeter@kaist.ac.kr>.
Pseudocode No The paper describes the architecture and processes in prose and diagrams (e.g., Section 3, Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/dojeon-ai/Co In
Open Datasets Yes We consider a total of 12 tasks across three different domains: 2 tasks from Adroit (Rajeswaran et al., 2018), 5 tasks from Meta World (Yu et al., 2020), and 5 tasks from DMC (Tassa et al., 2018).Following existing work (Hansen et al., 2022; Parisi et al., 2022; Majumdar et al., 2023; Nair et al., 2022), we utilize 100 expert demonstrations for Adroit and DMC, and 25 for Meta World, across a training span of 100 epochs.
Dataset Splits No The paper states, "Following existing work (...), we utilize 100 expert demonstrations for Adroit and DMC, and 25 for Meta World, across a training span of 100 epochs." and "The visuo-motor control policy s performance is evaluated every 5 epochs, with the best success rate achieved during training reported across three independent runs for each task." While it describes a training and evaluation process, it does not explicitly provide percentages or sample counts for training, validation, and test dataset splits.
Hardware Specification Yes inference speed was calculated on a single RTX-3090 GPU using a single input image with a resolution of 224 x 224.
Software Dependencies No The paper mentions optimizers like "Adam W" and "Adam" and uses "ptflops" for computation costs, but it does not specify software library versions (e.g., "PyTorch 1.9", "TensorFlow 2.x") or specific versions for tools like ptflops.
Experiment Setup Yes Detailed hyperparameters for finetuning pretrained visual encoders with and without Co In are listed in Table 8. and Detailed hyperparameters for finetuning the control policy network are listed in Table 9.