Activity Image-to-Video Retrieval by Disentangling Appearance and Motion
Authors: Liu Liu, Jiangtong Li, Li Niu, Ruicong Xu, Liqing Zhang2145-2153
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our MAP-IVR approach remarkably outperforms the state-of-the-art approaches on two benchmark activity-based video datasets. |
| Researcher Affiliation | Collaboration | Liu Liu, 1 Jiangtong Li, 1 Li Niu*, 1 Ruicong Xu, 2 Liqing Zhang 1 1 Mo E Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University 2 MEITUAN |
| Pseudocode | No | The paper includes a flowchart (Figure 2) and mathematical equations, but does not contain any blocks explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper mentions that the model is implemented in PyTorch and that supplementary material exists for a significance test, but it does not contain an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | based on two public activity video datasets THUMOS 14 (Jiang et al. 2014) 1 and Activity Net (Heilbron et al. 2015) 2. 1https://www.crcv.ucf.edu/THUMOS14/home.html 2http://activity-net.org/ |
| Dataset Splits | Yes | For THUMOS 14, we obtain 7028 image-video pairs in total, and then divide them into 5614 training pairs and 1414 test pairs, where the test pairs exclude the validation videos because the validation set is used for finetuning R-C3D model. For Activity Net, we obtain 4739 image-video pairs in total, which are divided into 3790 training pairs and 949 test pairs. |
| Hardware Specification | Yes | Our model is implemented by Py Torch1.4 (Paszke et al. 2019) on Ubuntu 16.04 and trained on a single GTX 1080Ti GPU. |
| Software Dependencies | Yes | Our model is implemented by Py Torch1.4 (Paszke et al. 2019) on Ubuntu 16.04 and trained on a single GTX 1080Ti GPU. |
| Experiment Setup | Yes | During training, we choose Adam (Kingma and Ba 2015) with learning rate 1 10 4 and set batch size as 32 for 60 epochs. Additionally, we set λo as 1. While retrieving, we set λv as 0.5. Besides, we sample 25 motion uncertainty codes z for the retrieval in video feature space (i.e., h = 25). All the hyper-parameters are set via cross-validation. |