Dense Events Grounding in Video
Authors: Peijun Bao, Qian Zheng, Yadong Mu920-928
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments on large-scale datasets Activity Net Captions and TACo S. |
| Researcher Affiliation | Academia | Peijun Bao,1 Qian Zheng,2 Yadong Mu1* 1Peking University, China 2Nanyang Technological University, Singapore |
| Pseudocode | No | The paper describes the network architecture and components in text and diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | Activity Net Captions Activity Net Captions (Krishna et al. 2017) consists of 19,209 untrimmed videos. ... TACo S TACOS (Regneri et al. 2013) consists of 127 videos. |
| Dataset Splits | Yes | For a fair comparison, following the experimental setting in single sentence grounding (Zhang et al. 2020; Yuan et al. 2019), we use val 1 as validation set and val 2 as testing set. There are 37,417, 17,505, and 17,031 moment-sentence pairs in the training, validation and testing set, respectively. ... Following the standard data splitting, there are totally 10,146, 4,589 and 4,083 moment-sentence pairs in the training, validation and testing set, respectively. |
| Hardware Specification | No | The paper mentions using 'pretrained CNN (Tran et al. 2015)' for feature extraction but does not specify any hardware details (e.g., GPU/CPU models, memory, or cloud instances) used for running its own experiments. |
| Software Dependencies | No | The paper mentions using 'Glove word embedding', 'LSTM', and 'Adam' but does not specify any software versions for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python). |
| Experiment Setup | Yes | During training, We use Adam (Kingma and Ba 2014) with learning rate of 1 10 4, the momentum of 0.9 and batch size of 4 as optimization algorithm. ... The channel numbers of sentence feature and video proposal feature d S, d V are all set to 512 . We set the dimension of positional feature dpos to 128 and the size of compact set n to 512. The number of sampled clips N is set to 32, 64 for Activity Net Captions and TACo S respectively. For BM operations in the video encoder, we set sampling number of each proposal to 16, 32 for Activity Net Captions and TACo S respectively. ... For binary cross entropy loss, the scaling thresholds µmin and µmax are set to 0.5 and 1.0 for Activity Net Captions and 0.3 and 0.7 for TACo S. |