Dense Events Grounding in Video

Authors: Peijun Bao, Qian Zheng, Yadong Mu920-928

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on large-scale datasets Activity Net Captions and TACo S.
Researcher Affiliation Academia Peijun Bao,1 Qian Zheng,2 Yadong Mu1* 1Peking University, China 2Nanyang Technological University, Singapore
Pseudocode No The paper describes the network architecture and components in text and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes Activity Net Captions Activity Net Captions (Krishna et al. 2017) consists of 19,209 untrimmed videos. ... TACo S TACOS (Regneri et al. 2013) consists of 127 videos.
Dataset Splits Yes For a fair comparison, following the experimental setting in single sentence grounding (Zhang et al. 2020; Yuan et al. 2019), we use val 1 as validation set and val 2 as testing set. There are 37,417, 17,505, and 17,031 moment-sentence pairs in the training, validation and testing set, respectively. ... Following the standard data splitting, there are totally 10,146, 4,589 and 4,083 moment-sentence pairs in the training, validation and testing set, respectively.
Hardware Specification No The paper mentions using 'pretrained CNN (Tran et al. 2015)' for feature extraction but does not specify any hardware details (e.g., GPU/CPU models, memory, or cloud instances) used for running its own experiments.
Software Dependencies No The paper mentions using 'Glove word embedding', 'LSTM', and 'Adam' but does not specify any software versions for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python).
Experiment Setup Yes During training, We use Adam (Kingma and Ba 2014) with learning rate of 1 10 4, the momentum of 0.9 and batch size of 4 as optimization algorithm. ... The channel numbers of sentence feature and video proposal feature d S, d V are all set to 512 . We set the dimension of positional feature dpos to 128 and the size of compact set n to 512. The number of sampled clips N is set to 32, 64 for Activity Net Captions and TACo S respectively. For BM operations in the video encoder, we set sampling number of each proposal to 16, 32 for Activity Net Captions and TACo S respectively. ... For binary cross entropy loss, the scaling thresholds µmin and µmax are set to 0.5 and 1.0 for Activity Net Captions and 0.3 and 0.7 for TACo S.