GIM: Learning Generalizable Image Matcher From Internet Videos
Authors: Xuelun Shen, zhipeng cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, Cheng Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures as the number of downloaded videos increases (Fig. 1 (a)); with 50 hours of You Tube videos, the relative zero-shot performance improves by 6.9% 18.1%. |
| Researcher Affiliation | Collaboration | 1 Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, 361005, P.R. China 2 Intel Labs, 3 DJI Technology |
| Pseudocode | No | The paper describes the GIM framework in text and with a diagram (Figure 2) but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | The source code, a demo, and the benchmark are available at https://xuelunshen.com/gim. |
| Open Datasets | Yes | Given an architecture, GIM first trains it on standard domain-specific datasets (Li & Snavely, 2018; Dai et al., 2017). To experiment with commonly accessible data, we download 50 hours (hundreds of hours available) of tourism videos with the Creative Commons License from You Tube, covering 26 countries, 43 cities, various lightning conditions, dynamic objects and scene types. See Appendix D for details. |
| Dataset Splits | No | The paper discusses training on a mixture of in-domain data and video data, and evaluates on test sets. However, it does not provide specific details on validation dataset splits (e.g., percentages or sample counts for a validation set), or how hyperparameter tuning was performed with such a set. |
| Hardware Specification | Yes | It can process 12.5 hours of videos per day using 16 A100 GPUs, achieving a non-trivial performance boost for various state-of-the-art architectures. Given 8 A100 GPUs, it takes about 4, 5, and 7.5 days to train GIMSuper Glue, GIMLo FTR, and GIMDKM respectively. For small memory GPUs, we have tried 4 RTX-3090. |
| Software Dependencies | No | The paper mentions using 'Open CV to compute the homography matrix' and states that 'The training code and hyper-parameters of GIM strictly follow the original repositories of the individual architectures', implying reliance on the software stacks of those projects. However, it does not explicitly provide a list of specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | Unless otherwise stated, we use 50 hours of You Tube videos in all experiments, which provide roughly 180K pairs of training images. The training code and hyper-parameters of GIM strictly follow the original repositories of the individual architectures. For each video, we uniformly sample images every 20 frames to reduce redundancy. To obtain strong supervision signals, we propagate the correspondences as far as possible as long as we have more than 1024 correspondences between two images. Strong data augmentation: To experiment with various existing architectures, we apply the same loss used for domain-specific training to train the final GIM model, but only calculate the loss on the pixels with correspondences. Empirically, we find that strong data augmentations on video data provide better supervision signals (see Sec. 4.2 for the effect). Specifically, for each pair of video frames, we perform random perspective transformations beyond standard augmentations used in existing methods. |