Artemis: Towards Referential Understanding in Complex Videos
Authors: Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, DAVID DOERMANN, Qixiang Ye, Yunjie Tian
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results are promising both quantitatively and qualitatively. Additionally, we show that Artemis can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis. |
| Researcher Affiliation | Academia | 1University of Chinese Academy of Sciences 2University at Buffalo |
| Pseudocode | No | Not found. The paper describes the methodology in text and provides a framework diagram in Figure 2, but no explicit pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | Code and data are available at https://github.com/qiujihao19/Artemis. |
| Open Datasets | Yes | We collect video data for referential understanding from 7 datasets, including HC-STVG [44], VIDSentence [10], A2D Sentences [20], La SOT [18], Me Vi S [16], GOT10K [24], and MGIT [23]. |
| Dataset Splits | Yes | The validation portion, containing 3,400 video clips, evaluates Artemis’s ability. |
| Hardware Specification | Yes | This efficient design requires only 28 hours (3 hours for the final stage) on 8 NVIDIA-A800 GPUs. |
| Software Dependencies | No | The paper mentions specific models like Vicuna-7B v1.5 and CLIP ViT-L/14, and optimizer AdamW, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The training procedure of Artemis comprises three steps, (1) video-text pre-training, (2) video-based instruction tuning, and (3) video-based referring. We report the detailed training hyper-parameters of Artemis in Table 6. Table 6 includes: Peak learning rate (1e-3, 2e-5, 4e-5), Lo RA rank (16), Image resolution (224), Global batch size (256, 128, 48), Numerical precision (bfloat16, float16), etc. |