reproducibilityindex.ai

GIM: Learning Generalizable Image Matcher From Internet Videos

Authors: Xuelun Shen, zhipeng cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, Cheng Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures as the number of downloaded videos increases (Fig. 1 (a)); with 50 hours of You Tube videos, the relative zero-shot performance improves by 6.9% 18.1%.
Researcher Affiliation	Collaboration	1 Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, 361005, P.R. China 2 Intel Labs, 3 DJI Technology
Pseudocode	No	The paper describes the GIM framework in text and with a diagram (Figure 2) but does not provide structured pseudocode or an algorithm block.
Open Source Code	Yes	The source code, a demo, and the benchmark are available at https://xuelunshen.com/gim.
Open Datasets	Yes	Given an architecture, GIM ﬁrst trains it on standard domain-speciﬁc datasets (Li & Snavely, 2018; Dai et al., 2017). To experiment with commonly accessible data, we download 50 hours (hundreds of hours available) of tourism videos with the Creative Commons License from You Tube, covering 26 countries, 43 cities, various lightning conditions, dynamic objects and scene types. See Appendix D for details.
Dataset Splits	No	The paper discusses training on a mixture of in-domain data and video data, and evaluates on test sets. However, it does not provide specific details on validation dataset splits (e.g., percentages or sample counts for a validation set), or how hyperparameter tuning was performed with such a set.
Hardware Specification	Yes	It can process 12.5 hours of videos per day using 16 A100 GPUs, achieving a non-trivial performance boost for various state-of-the-art architectures. Given 8 A100 GPUs, it takes about 4, 5, and 7.5 days to train GIMSuper Glue, GIMLo FTR, and GIMDKM respectively. For small memory GPUs, we have tried 4 RTX-3090.
Software Dependencies	No	The paper mentions using 'Open CV to compute the homography matrix' and states that 'The training code and hyper-parameters of GIM strictly follow the original repositories of the individual architectures', implying reliance on the software stacks of those projects. However, it does not explicitly provide a list of specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	Unless otherwise stated, we use 50 hours of You Tube videos in all experiments, which provide roughly 180K pairs of training images. The training code and hyper-parameters of GIM strictly follow the original repositories of the individual architectures. For each video, we uniformly sample images every 20 frames to reduce redundancy. To obtain strong supervision signals, we propagate the correspondences as far as possible as long as we have more than 1024 correspondences between two images. Strong data augmentation: To experiment with various existing architectures, we apply the same loss used for domain-speciﬁc training to train the ﬁnal GIM model, but only calculate the loss on the pixels with correspondences. Empirically, we ﬁnd that strong data augmentations on video data provide better supervision signals (see Sec. 4.2 for the effect). Speciﬁcally, for each pair of video frames, we perform random perspective transformations beyond standard augmentations used in existing methods.