Efficient Temporal Action Segmentation via Boundary-aware Query Voting

Authors: Peiyao Wang, Yuewei Lin, Erik Blasch, jie wei, Haibin Ling

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on three popular TAS datasets validate the effectiveness of our method.
Researcher Affiliation Collaboration 1Stony Brook University, 2Brookhaven National Laboratory, 3Air Force Research Laboratory, 4The City College of New York
Pseudocode Yes Algorithm 1: Boundary-aware Query Voting
Open Source Code Yes The code for this project is publicly available at https://github.com/peiyao-w/Ba Former.
Open Datasets Yes We use three challenging datasets, GTEA [16], 50Salads [39] and Breakfast [25]
Dataset Splits Yes We conduct 4-fold cross-validation for GTEA and Breakfast and 5-fold cross-validation for 50Salads, consistent with previous research [21, 30, 43, 45, 15, 3, 42].
Hardware Specification Yes All experiments are conducted on an NVIDIA RTX 3090.
Software Dependencies No The paper mentions software components and models like I3D, ASFormer encoder, and Adam optimizer, but does not specify their version numbers or versions for other key software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes For the frame encoder, we use the pre-trained I3D [5] with fixed parameters, to obtain the frame-wise features with a dimension of 2048. The frame decoder allows any architecture designed for dense prediction tasks. In our paper, we utilize the ASFormer encoder [45], replicating its configuration with 11 layers and an output dimension C of 64. For the Transformer decoder, the initialization starts with 90 queries for GTEA and 100 queries for 50Salads and Breakfast datasets, subsequently employing 10-layer Transformer decoders. Each decoder layer comprises three attention heads and a hidden dimension of 128. In the output heads, for mask and boundary prediction, we employ MLP layers with a hidden dimension of 64. As for Hungarian matching, λfocal = 5.0 and λdice = 1.0. During training, we adopt Adam optimizer [23] with the step learning rate schedule in [19]. Initial learning rates are set to 5 × 10^−4 for GTEA and 50salads datasets, while 1 × 10^−3 for Breakfast dataset, incorporating a decay factor of 0.5. The training is with 300 epochs utilizing a batch size of 1.