reproducibilityindex.ai

Weakly-Guided Self-Supervised Pretraining for Temporal Activity Detection

Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that the models pretrained with the proposed weakly-guided self-supervised detection task outperform prior work on multiple challenging activity detection benchmarks, including Charades and Multi THUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for activity detection.
Researcher Affiliation	Collaboration	1Stony Brook University 2Wormpex AI Research
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	Yes	Code is available at github.com/kkahatapitiya/SSDet.
Open Datasets	Yes	We pretrain on commonly-used Kinetics-400 (Carreira and Zisserman 2017) and evaluate on rather-complex Charades (Sigurdsson et al. 2016) and Multi THUMOS (Yeung et al. 2018)
Dataset Splits	Yes	At inference, we make predictions for 25 equally-sampled frames per each input in the validation set, which is the standard Charades localization evaluation protocol (Sigurdsson et al. 2016) followed by all previous work.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments, beyond general mentions of 'compute requirement'.
Software Dependencies	No	The paper mentions software components like 'X3D' and 'Binary Cross-Entropy (BCE)', but does not provide specific version numbers for any software dependencies (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	We pretrain X3D for 100k iterations with a batch size of 64 and an initial learning rate of 0.05 which is reduced by a factor of 10 after 80k iterations. We use a dropout rate of 0.5. From each clip, we sample 16 frames at a stride of 5, following the usual X3D training setup. During training, first, each input is randomly sampled in [256, 320] pixels, spatially cropped to 224 224, and applied a random horizontal flip. We initialize X3D... train for 100 epochs with a batch size of 16. Initially, we have a learning rate of 0.02, which is decreased by a factor of 10 at 80 epochs. We train all methods on Charades with Binary Cross-Entropy (BCE) as localization and classification losses.