Object Scene Representation Transformer

Authors: Mehdi S. M. Sajjadi, Daniel Duckworth, Aravindh Mahendran, Sjoerd van Steenkiste, Filip Pavetic, Mario Lucic, Leonidas J. Guibas, Klaus Greff, Thomas Kipf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To investigate OSRT s capabilities, we evaluate it on a range of datasets. After confirming that the proposed method outperforms existing methods on their comparably simple datasets, we move on to a more realistic, highly challenging dataset for all further investigations. We evaluate models by their novel view reconstruction quality and unsupervised scene decomposition capabilities qualitatively and quantitatively. We further investigate OSRT s computational requirements compared to the baselines and close this section with some further analysis into which ingredients are crucial to enable OSRT s unsupervised scene decomposition qualities in challenging settings.
Researcher Affiliation Industry Google Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main ex- perimental results (either in the supplemental material or as a URL)? [No] The new MSN-Hard dataset is published on the website.
Open Datasets Yes CLEVR-3D [33]. This is a recently proposed multicamera variant of the CLEVR dataset, which is popular for evaluating object decomposition due to its simple structure and unambiguous objects.
Dataset Splits No The paper specifies training and test set sizes (e.g., '35k training and 100 test scenes') but does not explicitly mention a separate validation set or how data was split for validation.
Hardware Specification Yes OSRT renders novel views at 32.5 fps (frames per second), more than 3000 faster than Ob Su RF which only achieves 0.01 fps, both measured on an Nvidia V100 GPU.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup No The paper mentions the use of L2 reconstruction loss and the number of input views (e.g., '5 input views', '3 input views'), and details about slot initialization ('7 randomly initialized slots', '11 slots using a learned initialization'), but it does not specify concrete hyperparameter values such as learning rate, batch size, or optimizer settings within the provided text.