CoVR: Learning Composed Video Retrieval from Web Video Captions
Authors: Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments further demonstrate that training a Co VR model on our dataset effectively transfers to Co IR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and Fashion IQ benchmarks. |
| Researcher Affiliation | Academia | 1LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France 2 Inria, ENS, CNRS, PSL Research University, France lucas.ventura@enpc.fr |
| Pseudocode | No | The paper describes methods in prose and with figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr. |
| Open Datasets | Yes | We apply our triplet generation approach to the Web Vid2M dataset (Bain et al. 2021) which contains 2.5M Web-scraped video-caption pairs. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr. |
| Dataset Splits | Yes | The dataset is divided into training, validation, and testing splits with 28225/16742, 4181/2265 and 4148/2178 queries/images, respectively. It is divided into training and validation splits with 18000/45429 and 6016/15415 queries/images, respectively. |
| Hardware Specification | Yes | Experiments are conducted on 4 NVIDIA A100-SXM4-80GB GPUs. |
| Software Dependencies | No | The paper mentions models used (LLaMA 7B, BLIP, ViT-L), but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train our Co VR model on Web Vid-Co VR for 4 epochs with a batch size of 2048 and an initial learning rate of 1e 5. To finetune on CIRR/Fashion IQ, we train for 6 epochs with a batch size of 2048/1024 and an initial learning rate of 1e 4. We set hyperparameters based on the validation curve of Web Vid-Co VR. |