Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit K. Roy-Chowdhury, Christian R. Shelton, Manmohan Chandraker, Abhishek Aich
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on four public dash-cam video benchmarks show that ıFinder s proposed grounding with domain-specific cues especially object orientation and global context significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. |
| Researcher Affiliation | Collaboration | Manyi Yao , Bingbing Zhuang , Sparsh Garg , Amit Roy-Chowdhury , Christian Shelton , Manmohan Chandraker , , Abhishek Aich NEC Laboratories, America, University of California, Riverside, University of California, San Diego |
| Pseudocode | No | The complete ıFinder pipeline is as follows. The process begins with input video frames that are first undistorted to correct lens distortion. These frames are then processed by a suite of pretrained vision modules that extract critical driving cues: scene context, ego-vehicle motion, 2D/3D object detections, object tracking, lane assignments, object distances, and semantic attributes. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Releasing the code is contingent upon receiving approval from the author s organization. |
| Open Datasets | Yes | We choose datasets and baselines that require all the methods to analyze the complete video before answering the user query. To this end, we use four benchmarks: MM-AU (Multi-Modal Accident Video Understanding) [22], SUTD (Traffic Question Answering) [78], Lingo QA [79], and Nexar [80] dataset. ... Table 9: Licenses and sources of datasets and models used in this work. MM-AU[22] CC-BY-NC-4.0 SUTD-Traffic QA [78] Custom (research-only, non-commercial) Lingo QA [79] https://github.com/wayveai/Lingo QA (allowed for research) Nexar [80] nexar-open-data-license |
| Dataset Splits | Yes | For accident occurrence prediction, we evaluate on the Nexar dataset [80]. Since the original test set does not include ground-truth labels, we randomly sample 100 videos from the training set to construct an evaluation set, maintaining a balanced distribution of 50 accident and 50 non-accident videos, consistent with the ratio in the full training set. |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA A6000 GPU with 48 GB of memory. |
| Software Dependencies | No | In Step 1, we use Geo Calib [67] for estimating camera parameters and distortion coefficients. To correct the lens distortion, we use Open CV s undistort [68] function for R. In Step 2, we use Intern VL [69] for FI-VLM and Video LLa MA2 [14] for FV-VLM. In Step 3, we use DROID-SLAM [70] for Fcam-pose. In Step 4 for F2D-det, we use OWL-V2 [71] and Byte Tracker [72] for F2D-track. In Step 5, we use OMR [73] for Flane. In Step 6, Metric3D [74] is used for Fdepth and SAM [75] for Fseg. In Step 7, we again use Intern VL for FI-VLM. In Step 8, we use Center Track [76] for F3D-det. For peer-informed reasoning, Video LLa MA2 [14] serves as the default peer model unless otherwise specified. For final reasoning step, we use GPT-4o-mini [77] for FLLM. |
| Experiment Setup | Yes | The full list of hyperparameters and prompts is in the Supplementary Material. ... In Step 3, we sample the temporal points in order to reduce noise and make the estimation insensitive to small deviations. Further, we set τa and τs as 30 and standard deviation of all speeds {st}T t=0. For motion estimation, we set g as 2. In Step 4, since we use Owl-V2, we set the 2D classes as [ motorcycle , police car , ambulance , bicycle , traffic light , stop sign , road sign , construction worker , police officer , ambulance , fire truck , construction vehicle , traffic cone , person , car , wheelchair , bus , truck ] with confidence threshold as 0.25. In Step 5, we only estimate lane locations for vehicles and person categories. In Step 8, we use the default classes by Center Track [76] for Nu Scenes dataset [81]. All the rest of the parameters are set as default model choices. All the prompts are provided in Section E. |