Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

Authors: Shuo Cheng, Liqian Ma, Zhenyang Chen, Ajay Mandlekar, Caelan Garrett, Danfei Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We aim to validate the following core hypotheses. H1: Our method effectively learns complex manipulation tasks in both simulation and the real world. H2: Our method generalizes to target domains only seen in simulation. H3: Our method is broadly applicable to multiple observation modalities. H4: Scaling up simulation data coverage improves generalization performance. 5.1 Experiment Setups To evaluate the effectiveness of our approach, we conduct comprehensive experiments in both simto-sim and sim-to-real transfer scenarios on a suite of robotic tabletop manipulation tasks: Lift, Box In Bin, Stack, Square, Mug Hang, and Drawer. These tasks are designed to test the system s ability to handle key challenges in robotic manipulation, including dense object interactions, longhorizon reasoning, and high-precision control.
Researcher Affiliation Collaboration Shuo Cheng1* Liqian Ma1* Zhenyang Chen1 Ajay Mandlekar2 Caelan Garrett2 Danfei Xu1 1 Georgia Institute of Technology 2 NVIDIA Corporation
Pseudocode Yes Algorithm 1 Joint Policy Training with OT Require: Source dataset Dsrc, Target dataset Dtgt 1: Initialize encoder fϕ, and policy πθ 2: Compute DTW distances for all trajectories pairs in Dsrc and Dtgt 3: for iteration t = 1 to T do 4: Sample a paired batch {(oi src, xi src, ai src, oj tgt, xj tgt, aj tgt)} with size N from Dsrc and Dtgt using strategy described in Sec. 4.3 5: Compute features {zi src} and {zj tgt} using encoder fϕ 6: Construct ground cost matrix ˆCϕ as described in Sec. 4.1 7: Compute optimal transport plan Π = arg minΠ RN N + ( Π, ˆCϕ F + ϵ Ω(Π) + τ KL(Π1||p) + τ KL(Π 1||q)) via Sinkhorn-Knopp algorithm [49] 8: Compute OT loss LUOT(fϕ) = Π , ˆCϕ F 9: Sample {(oi src, xi src, ai src)} from Dsrc and sample {(oj tgt, xj tgt, aj tgt)} from Dtgt 10: Compute BC loss LBC(fϕ, πθ) 11: Update fϕ and πθ with gradients of LBC(fϕ, πθ) + λ LUOT(fϕ) 12: end for
Open Source Code Yes Project webpage: https://ot-sim2real.github.io/. (page 1) Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are available on our project webpage.
Open Datasets Yes Simulation data. For simulation experiments, we begin by collecting 10 human demonstrations per task. Using Mimic Gen [10], we synthesize 200-1000 trajectories in the source domain, covering the full range of initial states (denoted as Source). In the target domain, we divide the reset region into two subregions: one is populated with 10 trajectories for training (denoted as Target), while the other remains completely held out from training (denoted as Target-OOD). Real data. For real-world experiments, we adopt a similar strategy by partitioning the reset region aligned with the simulation setup into two subregions. Based on task complexity, we collect 10 25 human demonstrations within one subregion and generate 1000 simulated trajectories. Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are available on our project webpage.
Dataset Splits Yes Simulation data. For simulation experiments, we begin by collecting 10 human demonstrations per task. Using Mimic Gen [10], we synthesize 200-1000 trajectories in the source domain, covering the full range of initial states (denoted as Source). In the target domain, we divide the reset region into two subregions: one is populated with 10 trajectories for training (denoted as Target), while the other remains completely held out from training (denoted as Target-OOD). This held-out subregion is used to evaluate each method s generalization under Out-Of-Distribution (OOD) conditions. Source: A large region that is densely covered by demonstrations in the source domain. We generate 1000 demonstrations using Mimic Gen [10] within the Source region. Target: A small subset of the Source region. This region is sparsely covered by demonstrations in the target domain, and is therefore considered in-distribution during evaluation. For sim-to-sim transfer, we collect 10 demonstrations within this region. For sim-to-real transfer, the number of real-world demonstrations collected in the Target region is adjusted based on task difficulty, as detailed in Tab. 5. Target-OOD: No demonstrations are collected in the Target-OOD region, which is used solely for evaluation and treated as out-of-distribution (OOD).
Hardware Specification No The system setup is illustrated in Fig. 5. We use a Franka Emika Panda robot controlled via a joint impedance controller [56] running at 20 Hz for policy execution. For data collection, the robot is teleoperated using a Meta Quest 3 headset, with tracked Cartesian poses converted to joint configurations through inverse kinematics. RGB image and depth are captured using an Intel Real Sense D435 depth camera. (This text describes robot, camera, and headset for data collection and deployment, not the specific computational hardware like GPUs or CPUs used for training the models).
Software Dependencies No For point cloud-based experiments, we adopt the 3D Diffusion Policy architecture [28] with a Point Net encoder [52]. For experiments with image-based policy, we adopt Diffusion Policy [27] with a Res Net-based [54] visual encoder. (This text mentions software architectures but does not provide specific version numbers for these components.)
Experiment Setup Yes Our overall training procedure is summarized in Algm. 1. We use a batch size of 256 for the behavior cloning loss LBC, with a co-training ratio of 0.9 following Maddukuri et al. [22]. For the optimal transport loss LOT, the batch size is set to 128, with a weighting coefficient λ = 0.1. We use ϵ = 0.0005 and τ = 0.01 in our experiments.