Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transframer: Arbitrary Frame Prediction with Generative Models

Authors: Charlie Nash, Joao Carreira, Jacob C Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, Peter Battaglia

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.
Researcher Affiliation Academia The paper states "Anonymous authors Paper under double-blind review". Therefore, no institutional affiliations are provided to classify the author types.
Pseudocode No The paper describes the architecture and methodology using text and diagrams (Figure 2a and 2b), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about code release, links to a code repository, or mention of code in supplementary materials for the described methodology. A link to example outputs is provided, but not to the source code.
Open Datasets Yes We first evaluate our model on BAIR (Ebert et al., 2017), which is one of the most well studied video modelling datasets, consisting of short video clips of a single robot arm interactions in the unconditional setting. Kinetics600 (Carreira et al., 2018a) is an action recognition dataset, consisting of video clips of dynamic human actions across 600 activities, such as sailing, chopping, and dancing. The KITTI dataset (Geiger et al., 2012) contains long video clips of roads and surrounding areas taken from the viewpoint of a car driver. To evaluate our model in the action-conditional setting, we use Robo Net (Dasari et al., 2019), which consists of short video clips of robot arms interacting with objects, and provides robot action annotations as 5-dimensional vectors. We evaluate our model on Shape Net benchmarks, in particular the chair and car subsets used by Yu et al. (Yu et al., 2021). The dataset consists of renders of 3D objects from the Shape Net database. As a more realistic view-synthesis task, we train and evaluate on the Objectron dataset. Objectron consists of short, object-centered clips, with full object and camera pose annotations. For the classification task, we use the Image Net dataset (Deng et al., 2009) and consider predicting one-hot images, where each class is encoded as a 8x8 white patch on a black background. On some tasks, such as Cityscapes, the model produces qualitatively good outputs
Dataset Splits Yes For consistency with previous work we only report test-set FVD for Kinetics600 and BAIR. We follow the evaluation regime of Villegas et al. (Villegas et al., 2017), operating at 64 64 resolution, and generate 25 frames given 5 context frames, with test clips taken from 3 longer test clips at 5-frame strides. At 64x64 resolution, our model improves on alternatives in every metric by a large margin. At 128x128, we could not find comparable previous work, so we report our results in Appendix E for future comparisons. We train at 64x64 and 128x128 resolutions, and evaluate using 2 context frames and 10 sampled frames on the test set specified by Fit Vid (Babaeizadeh et al., 2021). As in Yu et al. (Yu et al., 2021), we evaluate using either 1 or 2 context views, and predict the remaining views in a 251-frame test-set.
Hardware Specification No The paper mentions "efficient performance on TPUs" and cites "Google cloud tpus. https://cloud.google.com/tpu/docs/tpus." However, it does not specify the exact model or version of the TPUs used for the experiments.
Software Dependencies No The paper describes architectural components like U-Net, Transformer, and references various models (Wave Net, DCTransformer) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup No The paper states: "Please see the appendix for model hyperparameters, training details, and additional ablation studies." This indicates that specific experimental setup details, including concrete hyperparameter values, are not provided in the main text.