Entity-aware and Motion-aware Transformers for Language-driven Action Localization

Authors: Shuo Yang, Xinxiao Wu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Charades-STA and TACo S datasets demonstrate that our method achieves better performance than existing methods.
Researcher Affiliation Academia Shuo Yang , Xinxiao Wu Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology {shuoyang,wuxinxiao}@bit.edu.cn
Pseudocode No The paper includes architectural diagrams (Figure 3) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/shuoyang129/eamat
Open Datasets Yes We evaluate our method on two datasets: Charades STA [Gao et al., 2017] and TACo S [Regneri et al., 2013].
Dataset Splits Yes The Charades-STA dataset is built on the Charades dataset [Sigurdsson et al., 2016] and contains 16,128 annotations, including 12,408 for training and 3,720 for test. The TACo S dataset is built on the MPII Cooking Compositive dataset [Rohrbach et al., 2012] and contains 18,818 annotations, including 10146 for training, 4589 for validation, and 4083 for test.
Hardware Specification No The paper mentions using pre-extracted 3D convolutional features (C3D, I3D) but does not specify any hardware details like GPU/CPU models or memory used for training or inference.
Software Dependencies No The paper mentions using Adam for optimization but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes We adopt Adam [Kingma and Ba, 2014] for optimization with an initial learning rate of 5e-4 and a linear decay schedule. The loss weights λ1 and λ2 in Equation (14) are set to 1 and 10, respectively. The number of Transformer Blocks is set to 1 and 3 for early and late Transformers in entity-aware and motion-aware Transformers. The feature dimension of all intermediate layers is set to 512, the head number of multi-head self-attention is set to 8, the layer number and scale number of long short-term memory are set to 1 and 3, respectively.