Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SIRE: SE(3) Intrinsic Rigidity Embeddings
Authors: Cameron Omid Smith, Basile Van Hoorick, Chonghyuk Song, Vincent Sitzmann, Vitor Campagnolo Guizilini, Yue Wang
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on broad tasks to validate our model s learned rigidity embeddings as well as intermediate geometry representations. Our experiments demonstrate that both dataset-wide and even per-video optimization of our rigidity embeddings produces strong semantic features useful for downstream segmentation tasks. Our findings suggest that this simple formulation can pave the way towards selfsupervised learning of priors over geometry and object rigidities from large-scale video data. |
| Researcher Affiliation | Collaboration | Cameron Smith1 Basile Van Hoorick2 Chonghyuk Song3 Vincent Sitzmann3 Vitor Guizilini2 Yue Wang1 1University of Southern California 2Toyota Research Institute 3Massachusetts Institute of Technology |
| Pseudocode | No | The paper includes figures (e.g., Figure 2) that illustrate the training forward pass, but it does not contain explicit pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. It mentions an Open Review link for review purposes but not for code. |
| Open Datasets | Yes | We evaluate SIRE s intrinsic rigidity embeddings and intermediate 4D reconstruction quality on multiple downstream tasks, including frozen feature segmentation estimation, self-supervised depth estimation, and SE(3) trajectory estimation (though we emphasize that these geometry reconstructions are intermediate byproducts of our method and not the primary objective). Our results suggest that our learned intrinsic rigidity embeddings contain semantically intuitive segmentation-like features of moving objects in videos and that our method can be used to learn powerful priors on large-scale video data. For results CO3D-Dogs, we pre-train our model on the entire CO3D-Dogs dataset, and for the DAVIS scenes, we fit our model from-scratch, per-video. We evaluate on the DAVIS dataset (Caelles et al., 2019), which provides ground-truth moving object segmentations. We evaluate self-supervised depth estimation on the CO3D-Dogs (Sinha et al., 2023) dataset |
| Dataset Splits | Yes | To train our model, we use the 1k videos from the CO3D-Dogs dataset. For evaluating on depth estimation (above), we evaluate on the first frame of 100 videos, and for SE(3) trajectory estimation (below), we similarly evaluate trajectories on the first 30 frames of 100 videos. |
| Hardware Specification | Yes | For full-scale training across multiple videos, we optimize on a single 48GB GPU, using a batch size of 120 frames (10-frame videos with batch size of 12). |
| Software Dependencies | No | We leverage CO-Tracker (Karaev et al., 2023) for our 2D point track estimations, ML-PRO (Bochkovskii et al., 2024) and Video Depth Anything (Chen et al., 2025) for any ground-truth depth (we use Video Depth Anything s estimates scaled to be metric via ML-PRO), Geo Calib (Veicht et al., 2024) for estimates of intrinsics from each video, and Adam for optimization. The paper mentions various tools and an optimizer (Adam), but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | When optimizing per scene ( 40 frames of a video), convergence is achieved within minutes, with just a few GB of VRAM. For full-scale training across multiple videos, we optimize on a single 48GB GPU, using a batch size of 120 frames (10-frame videos with batch size of 12). The model is trained for 30k iterations ( 0.5 days), though we observe that convergence is largely achieved within the first 5k iterations, demonstrating the computeand data-efficiency of our method. We leverage CO-Tracker (Karaev et al., 2023) for our 2D point track estimations, ML-PRO (Bochkovskii et al., 2024) and Video Depth Anything (Chen et al., 2025) for any ground-truth depth (we use Video Depth Anything s estimates scaled to be metric via ML-PRO), Geo Calib (Veicht et al., 2024) for estimates of intrinsics from each video, and Adam for optimization. |