Weakly-Guided Self-Supervised Pretraining for Temporal Activity Detection
Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the models pretrained with the proposed weakly-guided self-supervised detection task outperform prior work on multiple challenging activity detection benchmarks, including Charades and Multi THUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for activity detection. |
| Researcher Affiliation | Collaboration | 1Stony Brook University 2Wormpex AI Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Code is available at github.com/kkahatapitiya/SSDet. |
| Open Datasets | Yes | We pretrain on commonly-used Kinetics-400 (Carreira and Zisserman 2017) and evaluate on rather-complex Charades (Sigurdsson et al. 2016) and Multi THUMOS (Yeung et al. 2018) |
| Dataset Splits | Yes | At inference, we make predictions for 25 equally-sampled frames per each input in the validation set, which is the standard Charades localization evaluation protocol (Sigurdsson et al. 2016) followed by all previous work. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments, beyond general mentions of 'compute requirement'. |
| Software Dependencies | No | The paper mentions software components like 'X3D' and 'Binary Cross-Entropy (BCE)', but does not provide specific version numbers for any software dependencies (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | We pretrain X3D for 100k iterations with a batch size of 64 and an initial learning rate of 0.05 which is reduced by a factor of 10 after 80k iterations. We use a dropout rate of 0.5. From each clip, we sample 16 frames at a stride of 5, following the usual X3D training setup. During training, first, each input is randomly sampled in [256, 320] pixels, spatially cropped to 224 224, and applied a random horizontal flip. We initialize X3D... train for 100 epochs with a batch size of 16. Initially, we have a learning rate of 0.02, which is decreased by a factor of 10 at 80 epochs. We train all methods on Charades with Binary Cross-Entropy (BCE) as localization and classification losses. |