Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Authors: Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on the Argoverse 2, Waymo and nu Scenes datasets show that Flow achieves state-of-the-art performance with up to 22% lower error and 2 faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability.
Researcher Affiliation Collaboration Qingwen Zhang1,B Xiaomeng Zhu1,3 Yushan Zhang2,B Yixi Cai1 Olov Andersson1 Patric Jensfelt1 1KTH Royal Institute of Technology 2Linköping University 3Scania CV AB EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Scheme Implementation
Open Source Code Yes The code is open-sourced at https://github.com/Kin-Zhang/Delta Flow along with trained model weights.
Open Datasets Yes Experiments are conducted on three commonly used large-scale autonomous driving datasets in scene flow estimation: Argoverse 2 [39], which employs two roof-mounted 32-channel Li DARs; Waymo [33], which uses a single 64-channel Li DAR; and nu Scenes [5], which uses a 32-channel Li DAR.
Dataset Splits Yes Argoverse 2 provides an official public scene flow challenge [1], consisting of 700 training and 150 validation scenes, each lasting 15 seconds at 10 Hz, totaling 110,071 point cloud frames. Waymo [15, 33] contains 798 training and 202 validation sequences, each recorded at 10 Hz for around 20 seconds. The training set consists of 155,000 frames. nu Scenes [5] includes 700 training and 150 validation scenes, each recorded at 20 Hz for around 20 seconds. The training set contains 275,150 frames, of which 27,392 ( 10%) are annotated with ground-truth labels, yielding an effective annotation rate of 2 Hz.
Hardware Specification Yes The model is trained using the Adam optimizer [27], with a batch size of 20 across 10 NVIDIA 3080 GPUs for around 18 hours over 21 epochs. We used a batch size of 32, a fixed learning rate of 4 10 3, and trained on four NVIDIA A100 GPUs for all models. Runtime evaluations are conducted on a desktop system equipped with an Intel i7-12700KF processor and a single NVIDIA RTX 3090 GPU. The computations were enabled by the supercomputing resource Berzelius provided by National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg Foundation, Sweden.
Software Dependencies No We implement the 3D backbone Minkowski Net18 network architecture in Backbone( ), using the spconv library2, following the design in Fig. 4 of [9]. The detailed implementation of Decoder( ) is described below, following [44]. For leaderboard experiments, Argoverse 2 [39] test set results are directly obtained from the public leaderboard [1] to ensure a fair comparison. In the public leaderboard setting, evaluation is conducted within a 70 70 m area (or a 35 m perception range) around the ego vehicle. To align with this, Flow is initially trained on a 76.8 76.8 m grid, corresponding to a 38.4 m perception range. The voxel grid size is 512 512 32 with voxel resolution set to (0.15, 0.15, 0.15) m in our best-performing configuration. The number of input frames is set to 5, with a time decay factor λ = 0.4. The model is trained using the Adam optimizer [27], with a batch size of 20 across 10 NVIDIA 3080 GPUs for around 18 hours over 21 epochs. We use a cosine decay learning rate schedule with a linear warmup. The learning rate reaches a target of 2 10 3 after the 2-epoch warmup phase and then decays to a minimum of 2 10 4.
Experiment Setup Yes The voxel grid size is 512 512 32 with voxel resolution set to (0.15, 0.15, 0.15) m in our best-performing configuration. The number of input frames is set to 5, with a time decay factor λ = 0.4. The model is trained using the Adam optimizer [27], with a batch size of 20 across 10 NVIDIA 3080 GPUs for around 18 hours over 21 epochs. We use a cosine decay learning rate schedule with a linear warmup. The learning rate reaches a target of 2 10 3 after the 2-epoch warmup phase and then decays to a minimum of 2 10 4. For Waymo and other local experiments, all baselines are retrained and reproduced under the same device settings to ensure consistent evaluation. To match default settings in prior methods, all models, including ours, are trained with a voxel resolution of 0.2 m, a spatial range of 51.2 m, a fixed total of 15 epochs, and the same training augmentation on the same computing cluster. We used a batch size of 32, a fixed learning rate of 4 10 3, and trained on four NVIDIA A100 GPUs for all models. For loss formulations, we assign category weights wc = [1.0, 1.5, 2.0, 2.5] corresponding to the meta-categories c = [cars, other vehicles, pedestrians, VRUs] as defined by Argoverse 2 [39]. We also apply speed-dependent weights γb = [0.1, 0.4, 0.5] for static (v < 0.4m/s), slow-moving (0.4 v < 1.0m/s), and dynamic (v 1.0m/s) objects.