Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaffolding Dexterous Manipulation with Vision-Language Models
Authors: Vincent de Bakker, Joey Hejna, Tyler Lum, Onur Celik, Aleksandar Taranovic, Denis Blessing, Gerhard Neumann, Jeannette Bohg, Dorsa Sadigh
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive suite of experiments to assess the effectiveness, generality, and robustness of our method across a diverse range of dexterous manipulation tasks. |
| Researcher Affiliation | Academia | 1Stanford University 2Karlsruhe Institute of Technology |
| Pseudocode | No | The method is described procedurally in Section 3.2 'Trajectory Generation for High-Level Policies via VLMs' and Section 3.3 'Low-Level Control with Reinforcement Learning', and summarized in Section 3.4 'The Full Pipeline', but no formal pseudocode or algorithm block is presented. |
| Open Source Code | Yes | We will provide code for simulated experiments in the the supplementary materials, though instructions may be lacking due to time limitations. |
| Open Datasets | Yes | We construct an evaluation suite using the Mani Skill simulator [45, 62] and Allegro Hand model... We situate our tasks in simulated scenes from the Replica CAD dataset [61]. |
| Dataset Splits | No | For each of N initial conditions from the environment, we sample corresponding high-level plans from πh. We then train the low-level policy using PPO [58] by randomly sampling from the set of N initial conditions and plans across massively parallelized simulation environments. In simulation, we track keypoints using ground-truth object information to generate low-level observations ol. Training across randomized plans is crucial for πl to be robust to both the keypoints and plans generated by πh. Further training details and hyperparameters are in Appendix A. Evaluation. At test time (Fig. 2 b)), we randomize the initial conditions of the environment. |
| Hardware Specification | Yes | Our training is performed on NVIDIA GPUs, ranging from A5000s to L40s. Depending on the specific task and hardware configuration, training durations vary between 1.5 and 6 hours. For real-world inference, we utilize two RTX 4090 GPUs. |
| Software Dependencies | No | The paper refers to `ManiSkill3 [62]` for simulations and `PPO [58]` for reinforcement learning, and `Gemini 2.5 Flash Thinking [63]` as the high-level policy, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The hyperparameters of our PPO training are detailed in Table 1. Table 1: PPO Hyperparameters Hyperparameter Value Normalize Advantage per Mini-Batch True Value Loss Coefficient 1.0 Clip Parameter 0.2 Use Clipped Value Loss True Desired KL 0.01 Entropy Coefficient 0.01 Discount Factor (Gamma) 0.99 GAE Lambda (Lam) 0.95 Max Gradient Norm 1.0 Learning Rate 0.0003 Number of Learning Epochs 5 Number of Mini-Batches 16 Schedule Adaptive Policy Class Name Actor Critic Activation Function ELU Actor Hidden Dimensions [512, 512, 512] Critic Hidden Dimensions [512, 512, 512] Initial Noise Std 1.0 Noise Std Type Scalar Number of Steps per Environment 24 Max Iterations 2000 Empirical Normalization True Number of Environments 2048 |