PACE: Predictive and Contrastive Embedding for Unsupervised Action Segmentation
Authors: Jiahao Wang, Jie Qin, Yunhong Wang, Annan Li
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three challenging benchmarks demonstrate the superiority of our method, with up to 26.9% improvements in F1score over the state of the art. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Virtual Reality Technology and System, Beihang University 2College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide explicit access information or links to source code for the described methodology. |
| Open Datasets | Yes | We evaluate the performance of PACE on three UAS benchmarks, namely Breakfast [Kuehne et al., 2014], 50Salads [Stein and Mc Kenna, 2013] and You Tube Instructions (YTI) [Alayrac et al., 2016]. |
| Dataset Splits | No | The paper uses well-known benchmark datasets (Breakfast, 50Salads, YTI) but does not explicitly provide details on training, validation, and test splits (e.g., percentages, sample counts, or specific predefined split names/citations). |
| Hardware Specification | Yes | All experiments are conducted with two NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like TensorFlow and Scipy but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | The encoder has 3 basic layers in total. There are 4 attention heads in SA, and the inner-layer of FFN is 2,048-d. We set the dimensionalities of both hidden representations (h) and contrastive embeddings (m) to 512. The encoder takes as input video sequences of 100 frames (i.e. n = 100) and the sequence is then divided into clips with length s = 5. We set the training batch size to 32. To increase the contrastive power in Lctrst, we expand negative samples with Cj from different sequences in the same batch. We empirically set α to 0.1 in order to maintain comparable scales of the two losses. We utilize the Adam [Kingma and Ba, 2015] optimizer with a learning rate of 0.0001. The total training epochs are 50, 30, 10 on Breakfast, 50Salads and YTI, respectively, according to their scales. |