Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transframer: Arbitrary Frame Prediction with Generative Models

Authors: Charlie Nash, Joao Carreira, Jacob C Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, Peter Battaglia

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach uniﬁes a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classiﬁcation and optical ﬂow prediction with no task-speciﬁc architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.
Researcher Affiliation	Academia	The paper states "Anonymous authors Paper under double-blind review". Therefore, no institutional affiliations are provided to classify the author types.
Pseudocode	No	The paper describes the architecture and methodology using text and diagrams (Figure 2a and 2b), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about code release, links to a code repository, or mention of code in supplementary materials for the described methodology. A link to example outputs is provided, but not to the source code.
Open Datasets	Yes	We ﬁrst evaluate our model on BAIR (Ebert et al., 2017), which is one of the most well studied video modelling datasets, consisting of short video clips of a single robot arm interactions in the unconditional setting. Kinetics600 (Carreira et al., 2018a) is an action recognition dataset, consisting of video clips of dynamic human actions across 600 activities, such as sailing, chopping, and dancing. The KITTI dataset (Geiger et al., 2012) contains long video clips of roads and surrounding areas taken from the viewpoint of a car driver. To evaluate our model in the action-conditional setting, we use Robo Net (Dasari et al., 2019), which consists of short video clips of robot arms interacting with objects, and provides robot action annotations as 5-dimensional vectors. We evaluate our model on Shape Net benchmarks, in particular the chair and car subsets used by Yu et al. (Yu et al., 2021). The dataset consists of renders of 3D objects from the Shape Net database. As a more realistic view-synthesis task, we train and evaluate on the Objectron dataset. Objectron consists of short, object-centered clips, with full object and camera pose annotations. For the classiﬁcation task, we use the Image Net dataset (Deng et al., 2009) and consider predicting one-hot images, where each class is encoded as a 8x8 white patch on a black background. On some tasks, such as Cityscapes, the model produces qualitatively good outputs
Dataset Splits	Yes	For consistency with previous work we only report test-set FVD for Kinetics600 and BAIR. We follow the evaluation regime of Villegas et al. (Villegas et al., 2017), operating at 64 64 resolution, and generate 25 frames given 5 context frames, with test clips taken from 3 longer test clips at 5-frame strides. At 64x64 resolution, our model improves on alternatives in every metric by a large margin. At 128x128, we could not ﬁnd comparable previous work, so we report our results in Appendix E for future comparisons. We train at 64x64 and 128x128 resolutions, and evaluate using 2 context frames and 10 sampled frames on the test set speciﬁed by Fit Vid (Babaeizadeh et al., 2021). As in Yu et al. (Yu et al., 2021), we evaluate using either 1 or 2 context views, and predict the remaining views in a 251-frame test-set.
Hardware Specification	No	The paper mentions "efficient performance on TPUs" and cites "Google cloud tpus. https://cloud.google.com/tpu/docs/tpus." However, it does not specify the exact model or version of the TPUs used for the experiments.
Software Dependencies	No	The paper describes architectural components like U-Net, Transformer, and references various models (Wave Net, DCTransformer) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	No	The paper states: "Please see the appendix for model hyperparameters, training details, and additional ablation studies." This indicates that specific experimental setup details, including concrete hyperparameter values, are not provided in the main text.