Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

Authors: Xiaogang Jia, Qian Wang, Anrui Wang, Han Wang, Balázs Gyenes, Emiliyan Gospodinov, Xinkai Jiang, Ge Li, Hongyi Zhou, Weiran Liao, Xi Huang, Maximilian Beck, Moritz Reuss, Rudolf Lioutikov, Gerhard Neumann

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on the Robo Casa, CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. We conduct experiments on two simulation benchmarks Robo Casa [20] and CALVIN [21].
Researcher Affiliation	Collaboration	1Karlsruhe Institute of Technology 2Reality Labs, Meta 3Johannes Kepler University Linz
Pseudocode	No	The paper describes methods through textual explanations and visual diagrams (Figure 2, Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We will open source the codes in the near future once they are cleaned up and anonymity is not a concern anymore.
Open Datasets	Yes	We validate the effectiveness of point map observations on two challenging benchmarks: Robo Casa [20] and CALVIN [21]. All the experiments we conducted were using open-source datasets.
Dataset Splits	Yes	For Robo Casa, each model was trained for 50 epochs using three random seeds, with performance measured at the 30th, 40th, and 50th checkpoints, selecting the best result. For the CALVIN benchmark, models were trained for 25 epochs, with the best success rate reported from the 10th, 15th, 20th, and 25th checkpoints. Policies are evaluated on 1,000 such instruction chains per seed. We evaluate Point Map Policy on one standard CALVIN settings: ABC D, where the policy is trained on environments A, B, and C, and evaluated zero-shot on D. Only 1% of the play data is paired with language.
Hardware Specification	Yes	For the CALVIN experiments, PMP employs Film-Res Net50 as encoders for both images and point maps, with 8 x-Blocks as backbones (512 latent dimensions), totaling 147M trainable parameters. Training utilizes 4 Nvidia RTX 6000 Ada GPUs with 128 samples per GPU (512 total batch size). For the Robo Casa experiments, PMP-Cat employs Conv Ne Xtv2 as encoders with 8 x-Blocks using 512 latent dimensions. Training utilizes 1 NVIDIA A100-SXM4-40GB with a 128 batch size. For the real-robot experiments, PMP-Cat employs Film-Res Net50 as encoders for both images and point maps, with 6 x-Blocks using 256 latent dimensions. Training utilizes 1 Nvidia RTX 6000 Ada GPUs with 128 batch size.
Software Dependencies	No	The paper mentions software components and libraries like PyTorch (implied by Grad-CAM++ implementation link) and specific models like Film-Res Net50 and Conv Ne Xtv2, but it does not specify exact version numbers for these software dependencies required for replication.
Experiment Setup	Yes	Table 6: Summary of all the Hyperparameters for our experiments. This table details specific values for parameters such as Number of x-Blocks, Attention Heads, Action Chunk Size, History Length, Embedding Dimension, Image Encoder, Goal Lang Encoder, Attention Dropout, Residual Dropout, MLP Dropout, Optimizer, Betas, Learning Rate, Transformer Weight Decay, Other weight decay, Batch Size, Train Steps in Thousands, σmax, σmin, σt, EMA, Time steps, Sampler, and Trainable Parameters.