Entity-aware and Motion-aware Transformers for Language-driven Action Localization
Authors: Shuo Yang, Xinxiao Wu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the Charades-STA and TACo S datasets demonstrate that our method achieves better performance than existing methods. |
| Researcher Affiliation | Academia | Shuo Yang , Xinxiao Wu Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology {shuoyang,wuxinxiao}@bit.edu.cn |
| Pseudocode | No | The paper includes architectural diagrams (Figure 3) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/shuoyang129/eamat |
| Open Datasets | Yes | We evaluate our method on two datasets: Charades STA [Gao et al., 2017] and TACo S [Regneri et al., 2013]. |
| Dataset Splits | Yes | The Charades-STA dataset is built on the Charades dataset [Sigurdsson et al., 2016] and contains 16,128 annotations, including 12,408 for training and 3,720 for test. The TACo S dataset is built on the MPII Cooking Compositive dataset [Rohrbach et al., 2012] and contains 18,818 annotations, including 10146 for training, 4589 for validation, and 4083 for test. |
| Hardware Specification | No | The paper mentions using pre-extracted 3D convolutional features (C3D, I3D) but does not specify any hardware details like GPU/CPU models or memory used for training or inference. |
| Software Dependencies | No | The paper mentions using Adam for optimization but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | We adopt Adam [Kingma and Ba, 2014] for optimization with an initial learning rate of 5e-4 and a linear decay schedule. The loss weights λ1 and λ2 in Equation (14) are set to 1 and 10, respectively. The number of Transformer Blocks is set to 1 and 3 for early and late Transformers in entity-aware and motion-aware Transformers. The feature dimension of all intermediate layers is set to 512, the head number of multi-head self-attention is set to 8, the layer number and scale number of long short-term memory are set to 1 and 3, respectively. |