Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generative Perception of Shape and Material from Differential Motion
Authors: Xinran Han, Ko Nishino, Todd Zickler
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train our model from scratch on short synthetic videos of moving and static objects, and we find that it generalizes to captured images and videos while exhibiting several desirable perceptual capabilities: (i) it achieves competitive results on existing static-image shape and material benchmarks; (ii) it exhibits an emergent ability to generate plausible multimodal samples when the input is ambiguous, such as the classic convex/concave ambiguity or the trivial postcard solution (see Figure 1); and crucially, (iii) it makes effective use of differential object motion to resolve perceptual ambiguities, improving prediction accuracy when additional motion frames are available. |
| Researcher Affiliation | Academia | Xinran Nicole Han Harvard University EMAIL Ko Nishino Kyoto University EMAIL Todd Zickler Harvard University EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures, such as Figure 2 depicting the U-ViT3D-Mixer architecture, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Yes we include those details in the supplementary materials. We plan to open source the code, pretrained model as well as our synthetically generated dataset. We will also provide our data generation script for customized usage. |
| Open Datasets | Yes | We evaluate shape accuracy using the Dili GENT dataset [51]... We assess albedo/texture estimation using the MIT Intrinsic Image Dataset [22]... Figure 8 compares our model to RGB-X [61] and Diffusion Renderer [40] on real-world videos of various objects under natural outdoor or indoor lighting. We test them in two settings (a) moving objects, same as our training paradigm and (b) moving camera, which is out-of-distribution for our model. For the latter we use the catured images in the Stanford-ORB dataset [38], by feeding our model sets of three images from nearby camera views. |
| Dataset Splits | Yes | We generate 45 short video clips (5 frames each) for each object, and we split them into pairs of consecutive 3-frame clips (F = 3) for training. This results in a dataset of approximately 100K video-attribute pairs. We conduct ablation studies on a held-out synthetic evaluation set containing objects undergoing differential motion. |
| Hardware Specification | Yes | Training requires roughly five days using four H100 GPUs. During inference, we use DDIM sampling [52] with 50 steps, which takes about 2.7 seconds per input video on a single A100 GPU. |
| Software Dependencies | Yes | We therefore construct a synthetic dataset for training using the Mitsuba3 renderer [31] with custom integrators to extract ground-truth shape and material. We train our model using the Adam W optimizer [45]. |
| Experiment Setup | Yes | Complete hyperparameter settings for our architecture are provided in the appendix. Appendix A.9 Network Architecture Details We use the following hyperparameters for the U-ViT3D-Mixer model. channels = [96, 192, 384, 768], block_dropout = [0, 0, 0.1, 0.1], block_type = [ Local3D (1), Local3D (1), Transformer (3), Transformer (8)], noise_embedding_channels = 768, attention_num_heads = 6, patch_size = 2, local_attention_window_size = 7, channel_mixer_expansion_factor = 3, loss_type = v-prediction (MSE) We use the following training setup. batch_size = 64, optimizer = Adam W , adam_betas = (0.9, 0.99), adam_weight_decay = 0.01 learning_rate = 1e-4, mixed_precision = bfloat16 , max_train_steps = 400k |