Part Aware Contrastive Learning for Self-Supervised Action Recognition

Authors: Yilei Hua, Wenhan Wu, Ce Zheng, Aidong Lu, Mengyuan Liu, Chen Chen, Shiqian Wu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiment results demonstrate that the inclusion of local feature similarity significantly enhances skeleton-based action representation. Our proposed Ske Attn CLR outperforms state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.
Researcher Affiliation Academia 1School of Information Science and Engineering, Wuhan University of Science and Technology 2University of North Carolina at Charlotte 3Center for Research in Computer Vision, University of Central Florida 4Peking University, Shenzhen Graduate School
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code and settings are available at this repository: https://github.com/Git Hub Of Hyl97/Ske Attn CLR
Open Datasets Yes NTU-RGB+D 60 (NTU-60). NTU-60 [Shahroudy et al., 2016] is a large-scale skeleton dataset for human skeletonbased action recognition, containing 56,578 videos with 60 actions and 25 joints for each human body. NTU-RGB+D 120 (NTU-120). NTU-120 [Liu et al., 2019] is an expansion dataset of NTU-60, containing 113,945 sequences with 120 action labels. PKU Multi-Modality Dataset (PKU-MMD). PKU-MMD [Liu et al., 2020] is a substantial dataset that encompasses a multi-modal 3D comprehension of human actions, containing around 20,000 instances and 51 distinct action labels.
Dataset Splits Yes The NTU dataset includes two evaluation protocols: the Cross-Subject (X-Sub) protocol, which divides data by subject with half used for training and half for testing, and the Cross-View (X-View) protocol, which uses different camera views for training. The testing samples are captured by cameras 2 and 3 for training, and samples from camera 1 are used for testing. NTU-120 also offers two evaluation protocols, the Cross-Subject (X-Sub) and Cross-Set (X-Set) protocols. In X-Sub, 53 subjects are used for training and 53 subjects are used for testing, while in X-Set, half of the setups are used for training (even setup IDs) and the remaining setups (odd setup IDs) are used for testing. During the contrastive learning training process, we directly use a KNN cluster every 10 epochs to cluster the feature embeddings extracted by the encoder and evaluate the clustering accuracy on the test set. Finally, the model with the highest KNN result is selected to participate in other experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions 'Py Torch [Paszke et al., 2019]' but does not provide specific version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes Our experiments mainly use the SGD optimizer [Ruder, 2016] to optimize the model. For all contrastive learning training, we use a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001 for a total of 300 epochs for training, and adjust the basic learning rate to one-tenth of the original at the 250th epoch. In addition, our data processing employs human skeleton action sequences with a length of 64 frames, and the batch size is 128. We choose ST-GCN [Yan et al., 2018] as the main backbone of our experiments for a fair comparison...