Further Understanding Videos through Adverbs: A New Video Task

Authors: Bo Pang, Kaiwen Zha, Yifan Zhang, Cewu Lu11823-11830

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments to show the challenge of BA recognition and evaluate our BAUN. Results reveal that: 1) BA recognition is a challenging task for current video understanding models. 2) BAUN enjoys accuracy gains from the elaborate structure, substantially better than the 3D CNN model. 3) The two semantics, BA and action, can propel each other forward to better performance. We evaluate our BAUN on both BA and action recognition tasks.
Researcher Affiliation Academia Shanghai Jiao Tong University {pangbo, kevin zha, zhangyf sjtu, lucewu}@sjtu.edu.cn
Pseudocode No The paper does not contain STRUCTURED PSEUDOCODE OR ALGORITHM BLOCKS.
Open Source Code No The paper states that the dataset will be released, but does not provide CONCRETE ACCESS TO SOURCE CODE (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes To exhaustively decode this semantics, we construct the Videos with Action and Adverb Dataset (VAAD), which is a large-scale dataset with a semantically complete set of BAs. The dataset will be released to the public with this paper. The data is collected from the existing datasets, including the Kinetics, UCF101, and HMDB51. We conduct BA recognition on our VAAD which is the only dataset with BA semantics, and on top of VAAD, we also evaluate our BAUN on action datasets: UCF101 (Soomro, Zamir, and Shah 2012), HMDB-51 (Kuehne, Jhuang, and others ) and Kinetics (Kay et al. 2017).
Dataset Splits No The paper does not provide SPECIFIC DATASET SPLIT INFORMATION (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions 'training and validation set' generally but no split ratios.
Hardware Specification No The paper does not provide SPECIFIC HARDWARE DETAILS (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide SPECIFIC ANCILLARY SOFTWARE DETAILS (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes When training on VAAD, 64 clips are fed into the network in each iteration. For Conv Net-LSTM and Two-Stream models, we use Res Net-50 pre-trained on Image Net as the backbone and Adam (Kingma and Ba 2014) optimizer with the learning rate initialized as 10 4 and decreased to 10 5 after 8 epochs, while for I3D and BAUN, we use the backbone pre-trained on Image Net and SGD with the learning rate initialized as 10 2 and decreased to 10 3 after 6 epochs. For our BAUN, we pre-train the 3D convolution network on the Image Net, and fine-tune it on VAAD with the STSM.