CATs: Cost Aggregation Transformers for Visual Correspondence
Authors: Seokju Cho, Sunghwan Hong, Sangryul Jeon, Yunsung Lee, Kwanghoon Sohn, Seungryong Kim
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies. Code and trained models are available at https://sunghwanhong.github.io/CATs/. |
| Researcher Affiliation | Academia | Seokju Cho Yonsei University Sunghwan Hong Korea University Sangryul Jeon Yonsei University Yunsung Lee Korea University Kwanghoon Sohn Yonsei University Seungryong Kim Korea University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and trained models are available at https://sunghwanhong.github.io/CATs/. |
| Open Datasets | Yes | SPair-71k [38] provides total 70,958 image pairs... we also consider PF-PASCAL [12] containing 1,351 image pairs from 20 categories and PF-WILLOW [11] containing 900 image pairs from 4 categories, each dataset providing corresponding ground-truth annotations. |
| Dataset Splits | No | The paper states training and test splits for the datasets (e.g., 'we train our network on the training split and evaluated on the test split'), but does not explicitly provide details for a validation split. |
| Hardware Specification | Yes | For a fair comparison, the results are obtained using a single NVIDIA Ge Force RTX 2080 Ti GPU and Intel Core i7-10700 CPU. |
| Software Dependencies | No | The paper states 'We implemented our network using Py Torch [40]', but does not provide specific version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | For the hyper-parameters for Transformer encoder, we set the depth as 1 and the number of heads as 6. We resize the spatial size of the input image pairs to 256 256 and a sequence of selected features are resized to 16 16. We use a learnable positional embedding [10], instead of fixed [61]. We implemented our network using Py Torch [40], and Adam W [33] optimizer with an initial learning rate of 3e 5 for the CATs layers and 3e 6 for the backbone features are used, which we gradually decrease during training. |