Efficient Temporal Action Segmentation via Boundary-aware Query Voting
Authors: Peiyao Wang, Yuewei Lin, Erik Blasch, jie wei, Haibin Ling
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on three popular TAS datasets validate the effectiveness of our method. |
| Researcher Affiliation | Collaboration | 1Stony Brook University, 2Brookhaven National Laboratory, 3Air Force Research Laboratory, 4The City College of New York |
| Pseudocode | Yes | Algorithm 1: Boundary-aware Query Voting |
| Open Source Code | Yes | The code for this project is publicly available at https://github.com/peiyao-w/Ba Former. |
| Open Datasets | Yes | We use three challenging datasets, GTEA [16], 50Salads [39] and Breakfast [25] |
| Dataset Splits | Yes | We conduct 4-fold cross-validation for GTEA and Breakfast and 5-fold cross-validation for 50Salads, consistent with previous research [21, 30, 43, 45, 15, 3, 42]. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA RTX 3090. |
| Software Dependencies | No | The paper mentions software components and models like I3D, ASFormer encoder, and Adam optimizer, but does not specify their version numbers or versions for other key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | For the frame encoder, we use the pre-trained I3D [5] with fixed parameters, to obtain the frame-wise features with a dimension of 2048. The frame decoder allows any architecture designed for dense prediction tasks. In our paper, we utilize the ASFormer encoder [45], replicating its configuration with 11 layers and an output dimension C of 64. For the Transformer decoder, the initialization starts with 90 queries for GTEA and 100 queries for 50Salads and Breakfast datasets, subsequently employing 10-layer Transformer decoders. Each decoder layer comprises three attention heads and a hidden dimension of 128. In the output heads, for mask and boundary prediction, we employ MLP layers with a hidden dimension of 64. As for Hungarian matching, λfocal = 5.0 and λdice = 1.0. During training, we adopt Adam optimizer [23] with the step learning rate schedule in [19]. Initial learning rates are set to 5 × 10^−4 for GTEA and 50salads datasets, while 1 × 10^−3 for Breakfast dataset, incorporating a decay factor of 0.5. The training is with 300 epochs utilizing a batch size of 1. |