Artemis: Towards Referential Understanding in Complex Videos

Authors: Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, DAVID DOERMANN, Qixiang Ye, Yunjie Tian

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results are promising both quantitatively and qualitatively. Additionally, we show that Artemis can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis.
Researcher Affiliation Academia 1University of Chinese Academy of Sciences 2University at Buffalo
Pseudocode No Not found. The paper describes the methodology in text and provides a framework diagram in Figure 2, but no explicit pseudocode or algorithm blocks are present.
Open Source Code Yes Code and data are available at https://github.com/qiujihao19/Artemis.
Open Datasets Yes We collect video data for referential understanding from 7 datasets, including HC-STVG [44], VIDSentence [10], A2D Sentences [20], La SOT [18], Me Vi S [16], GOT10K [24], and MGIT [23].
Dataset Splits Yes The validation portion, containing 3,400 video clips, evaluates Artemis’s ability.
Hardware Specification Yes This efficient design requires only 28 hours (3 hours for the final stage) on 8 NVIDIA-A800 GPUs.
Software Dependencies No The paper mentions specific models like Vicuna-7B v1.5 and CLIP ViT-L/14, and optimizer AdamW, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes The training procedure of Artemis comprises three steps, (1) video-text pre-training, (2) video-based instruction tuning, and (3) video-based referring. We report the detailed training hyper-parameters of Artemis in Table 6. Table 6 includes: Peak learning rate (1e-3, 2e-5, 4e-5), Lo RA rank (16), Image resolution (224), Global batch size (256, 128, 48), Numerical precision (bfloat16, float16), etc.