Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Youku Dense Caption: A Large-scale Chinese Video Dense Caption Dataset and Benchmarks
Authors: Zixuan Xiong, Guangwei Xu, wenkai zhang, Yuan Miao, Xuan Wu, LinHai, Ruijie Guo, Hai-Tao Zheng
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and evaluations are conducted on existing state-of-the-art multi-modal models, demonstrating the dataset s utility and the potential for further research. The dataset is publicly available at https://github.com/Open Search-AI/Youku-Dense Caption. We validate the effectiveness of the dataset through extensive experiments, demonstrating its significant impact on enhancing the performance of multi-modal model generation and retrieval. ... 5 EXPERIMENTS ... 6 ABLATION STUDY |
| Researcher Affiliation | Collaboration | 1Shenzhen International Graduate School, Tsinghua University 2Pengcheng Laboratory 3Alibaba Cloud Computing 4Alibaba Group |
| Pseudocode | Yes | Algorithm 1 Cross-Video PRVR Benchmark Setup |
| Open Source Code | Yes | The dataset is publicly available at https://github.com/Open Search-AI/Youku-Dense Caption. ... The authors will handle the long-term maintenance of the Youku Dense Caption dataset and the benchmarks evaluated in our paper. |
| Open Datasets | Yes | To address this gap within the Chinese community and to promote the advancement of Chinese multi-modal models, we develop the first, large-scale, and high-quality Chinese dense video captioning dataset, named Youku Dense Caption. ... The dataset is publicly available at https://github.com/Open Search-AI/Youku-Dense Caption. |
| Dataset Splits | Yes | When dividing benchmark data and non-benchmark data, we fixed the random seed to 42 and randomly selected 10% of the video IDs from the Youku Dense Caption dataset as the source for creating benchmark data. This subset consists of 3,185 videos and their corresponding 31,553 annotations. The remaining 90% of the data, comprising 28,281 videos and 280,368 annotations, is used as the source for training data. ... In the end, we obtain 1,872 videos and a total of 20,099 annotations, averaging 10.73 annotations per video. ... The text2video part was filtered from the original 31,553 texts down to 28,988 texts for retrieval evaluation, with an average of 4.96 related video IDs per text. |
| Hardware Specification | Yes | In the data augmentation experiments for the generation task, we train the data for 1 epoch using the default fine-tuning parameters of the Swift framework. The learning rate is set to 1e-4, and the batch size is set to 16. The training is conducted on eight A800 GPUs, resulting in an effective batch size of 128. |
| Software Dependencies | No | The paper mentions software components and models like Intern Video2-Chat-8B, Mistral-7B, Mini CPM-V-2.6, Qwen2-7B, Intern VL2-8B, Intern-Vi T-6B, Intern LM2.5-7B, Qwen2-VL-7B-Instruct, DFN, Swift framework, Moment-DETR, but does not specify version numbers for these software components or frameworks. |
| Experiment Setup | Yes | In the data augmentation experiments for the generation task, we train the data for 1 epoch using the default fine-tuning parameters of the Swift framework. The learning rate is set to 1e-4, and the batch size is set to 16. The training is conducted on eight A800 GPUs, resulting in an effective batch size of 128. In the data augmentation experiments for the retrieval task, we train the data for 10 epochs with a learning rate of 8e-4. The batch size per GPU is set to 256, and the training is conducted on eight A800 GPUs, resulting in an effective batch size of 2048. |