Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DisPIM: Distilling PreTrained Image Models for Generalizable Visuo-Motor Control

Authors: Haitao Wang, Hejun Wu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thoroughly evaluate the performance of our Dis PIM framework on three widely used generalization benchmarks: DMC-GB [Hansen and Wang, 2020], Drawer World [Wang et al., 2021], and Pix MC [Xiao et al., 2022]. We also perform real-world experiments in Aubo i5 robot. In all evaluations, our method demonstrates superior performance, showcasing its effectiveness and versatility. We perform online training on the simulator and deployed the model directly on the Aubo i5 robot to conduct experiments on the Lift Duck and Pick&Place tasks (shown in Figure 1). The experimental results are presented in Figure 7.
Researcher Affiliation Academia Haitao Wang1,2 , Hejun Wu 1,2 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, Guangdong, China EMAIL
Pseudocode No The paper describes the methodology and framework in detail using text and figures (e.g., Figure 3: Overview of our Dis PIM framework), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository. The only link provided (https://www.aubo-robotics.cn/) is for the robot used in experiments, not for the research code.
Open Datasets Yes We thoroughly evaluate the performance of our Dis PIM framework on three widely used generalization benchmarks: DMC-GB [Hansen and Wang, 2020], Drawer World [Wang et al., 2021], and Pix MC [Xiao et al., 2022]. For the teacher encoder, we use the Vi T-Small Encoder [Dosovitskiy et al., 2020] with a 16 16 patch size, 384 hidden sizes, 6 attention heads, and 12 blocks. We use the MAE framework [He et al., 2022] to pretrain the teacher model on the Image Net dataset [Deng et al., 2009]. In addition to Image Net [Deng et al., 2009], we also utilize another widely recognized dataset to pretrain the teacher model. These datasets include CLIP [Radford et al., 2021] (Contrastive Language-Image Pretraining) and Ego4D [Grauman et al., 2021] (Daily-life activity videos).
Dataset Splits Yes We thoroughly evaluate the performance of our Dis PIM framework on three widely used generalization benchmarks: DMC-GB [Hansen and Wang, 2020], Drawer World [Wang et al., 2021], and Pix MC [Xiao et al., 2022]. We evaluate the robustness in terms of the visual background changes on DMCGB. Models are trained in an original DMControl environment [Tassa et al., 2018], and we measure generalization to environments with natural videos as background. We measure generalization on surfaces of different textures, which are unlike the grid texture used for training. We introduce a distractor at test time that varies from the training object in color, shape, and size.
Hardware Specification No The paper mentions using an "Aubo i5 robot" for real-world experiments, but it does not specify any hardware details (like GPU or CPU models, memory, etc.) for the computational resources used to run the simulations or train the models. The Aubo i5 is the robotic platform, not the computational hardware for training.
Software Dependencies No The paper mentions using specific reinforcement learning algorithms like PPO [Schulman et al., 2017] and Dr Q-v2 [Yarats et al., 2021], and models like Vi T-Small Encoder [Dosovitskiy et al., 2020] and MAE framework [He et al., 2022]. However, it does not provide specific version numbers for any programming languages, libraries, or frameworks used in their implementation.
Experiment Setup Yes For the teacher encoder, we use the Vi T-Small Encoder [Dosovitskiy et al., 2020] with a 16 16 patch size, 384 hidden sizes, 6 attention heads, and 12 blocks. We use the MAE framework [He et al., 2022] to pretrain the teacher model on the Image Net dataset [Deng et al., 2009]. For the student encoder, we use a 6-layer transformer encoder, with the other parameters being the same as those of the teacher encoder. All experimental results are the average of five random seeds. The Q-dynamic weight is defined as Wq(s, a) = 2Οƒ Qstd(s, a). For UCB exploration, the action is chosen by at = max a {Qmean(st, a) + Ξ»Qstd(st, a)}.