Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills

Authors: Chunru Lin, Haotian Yuan, Yian Wang, Xiaowen Qiu, Tsun-Hsuan Johnson Wang, Minghao Guo, Bohan Wang, Yashraj Narang, Dieter Fox, Chuang Gan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach across a wide range of manipulation tasks involving rigid, deformable, and fluid objects. Experiments show that our method consistently outperforms strong baselines in terms of both task success rate and overall performance. Notably, our approach achieves a 50.0% average success rate, significantly surpassing other baselines such as 3D generation (21.4%) and tool retrieval (11.1%).
Researcher Affiliation	Collaboration	1University of Massachusetts Amherst 2Massachusetts Institute of Technology 3National University of Singapore 4NVIDIA 5MIT-IBM Watson AI Lab
Pseudocode	No	The paper describes its method components and APIs (e.g., grasp, move, release) and provides an example of a programmatic tool representation and assembly function in Appendix A, which is a Python code snippet rather than formal pseudocode for the overall algorithm or methodology.
Open Source Code	No	Our full pipeline, including code and APIs, will be made publicly available to support further research.
Open Datasets	No	We curated 9 robotic manipulation tasks inspired by everyday human activities. These tasks vary widely in both physical properties and functional requirements, involving objects made of rigid, liquid, and soft bodies. This diversity ensures that our tool design module accommodates a range of physical interactions, while also allowing our pipeline to be rigorously evaluated across realistic and varied challenges encountered in daily life. The task details can be found in Appendix C.
Dataset Splits	No	The paper describes its experimental setup based on curated tasks and runs each experiment 8 times, reporting best score and success rate. It does not mention any training/validation/test dataset splits as typically used in machine learning model training.
Hardware Specification	Yes	We test our pipeline using the Genesis [4] simulator and conduct experiments on both NVIDIA Ge Force RTX 4090 GPUs and 2080 Ti GPUs. ... Our tasks do not require large GPU memory, and many experiments were successfully conducted on a single NVIDIA 2080 Ti GPU, and others were conducted on NVIDIA Ge Force RTX 4090 GPUs.
Software Dependencies	No	The paper mentions using API calls to models like O3-Mini and Meshy and imports 'trimesh' in an example code snippet in the appendix, but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	We support geometry optimization by allowing each shape parameter si s to vary within a range of [0.5 s(0) i , 2.0 s(0) i ], where s(0) i is the initial value. The trajectory parameters q include translational and rotational waypoints in Cartesian space. Translational components are constrained to lie within 0.2 m of their initial values, while rotational components may vary within π radians. We apply the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [21] to optimize the joint parameter (s, q). CMA-ES searches for improved solutions by iteratively sampling a population of candidates {(si, qi)}λ i=1, where λ=20 is the population size. Each optimization run for 50 iterations.