PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

Authors: Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, Limin Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-m AP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-m AP metric, and also achieves promising results under the segmentation-m AP metric.
Researcher Affiliation Collaboration Jing Tan 1 Xiaotong Zhao 2 Xintian Shi 2 Bin Kang 2 Limin Wang 1,3 1State Key Laboratory for Novel Software Technology, Nanjing University 2Platform and Content Group (PCG), Tencent 3Shanghai AI Lab
Pseudocode No The paper contains architectural diagrams and mathematical equations, but no structured pseudocode or algorithm blocks labeled as such.
Open Source Code Yes Code is available at https://github.com/MCG-NJU/Point TAD.
Open Datasets Yes We conduct experiments on two popular multi-label action detection benchmarks: Multi THUMOS [46] and Charades [33].
Dataset Splits No The video sequence is pre-processed with sliding window mechanism. To accommodate most of the actions, the window size is set to 256 frames for Multi THUMOS (99.1% actions included), and 400 frames for Charades (97.3% actions included). The overlap ratio is 0.75 at training, and 0 at inference.
Hardware Specification Yes The network is trained on a server with 8 V100 GPUs.
Software Dependencies No We adopt Adam W as optimizer with 1e-4 weight decay. The network is trained on a server with 8 V100 GPUs. The batch size is 3 per GPU for Multi THUMOS and 2 per GPU for Charades. The learning rate is set to 2e-4 and drops by half at every 10 epochs.
Experiment Setup Yes The batch size is 3 per GPU for Multi THUMOS and 2 per GPU for Charades. The learning rate is set to 2e-4 and drops by half at every 10 epochs. Backbone learning rate is additionally multiplied with 0.1 for stable training. Nq is set to 48 for both benchmarks. The number of query points per query Ns is 21. The number of deformable sub-points is set to 4 according to the number of sampling points in Tad TR [25]. The optimal γ is 0.01 for both datasets.