Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DINO-Foresight: Looking into the Future with DINO
Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyridon Gidaris, Nikos Komodakis
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the very strong performance, robustness and scalability of our framework. Experimental results demonstrate a unique advantage of our approach our single model successfully handles multiple future-frame understanding tasks (semantic segmentation, instance segmentation, depth prediction, and surface normal prediction) where previous approaches required multiple specialized models. |
| Researcher Affiliation | Collaboration | 1Archimedes, Athena Research Center, Greece 2valeo.ai 3National Technical University of Athens 4University of Crete 5IACM-Forth |
| Pseudocode | No | The paper describes the methodology in prose, without explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page and code at https://dino-foresight.github.io/ |
| Open Datasets | Yes | Data. We assess our approach using the Cityscapes (Cordts et al., 2016) and nu Scenes (Caesar et al., 2020) datasets, both offering video sequences from urban driving environments. |
| Dataset Splits | Yes | The Cityscapes dataset includes 2,975 training sequences, 500 for validation, each with 30 frames... The nu Scenes dataset comprises of 700 training scenes and 150 validation scenes... For short-term prediction, the model uses frames 8, 11, 14, and 17 as context to predict frame 20 (with context length Nc = 4 and Np = 1). |
| Hardware Specification | Yes | Training is conducted on 8 A100 40Gb GPUs with an effective batch size of 64. |
| Software Dependencies | No | The paper mentions software like DINOv2, DPT, Mask2Former, Detectron2, and Adam optimizer but does not specify their version numbers or the Python/PyTorch versions used. |
| Experiment Setup | Yes | We use 12 layers with a hidden dimension of d = 1152 and sequence length N = 5 (with Nc = 4 context frames and Np = 1 future frame). For end-to-end training, we use the Adam optimizer (Kingma and Ba, 2015) with momentum parameters β1 = 0.9, β2 = 0.99, and a learning rate of 6.4 10 4 with cosine annealing. Training is conducted on 8 A100 40Gb GPUs with an effective batch size of 64. |