Relational Self-Attention: What's Missing in Attention for Video Understanding
Authors: Manjin Kim, Heeseung Kwon, CHUNYU WANG, Suha Kwak, Minsu Cho
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym. |
| Researcher Affiliation | Collaboration | Manjin Kim1 Heeseung Kwon1 Chunyu Wang2 Suha Kwak1 Minsu Cho1 1POSTECH 2Microsoft Research Asia |
| Pseudocode | Yes | Please refer to the pseudo code Fig. 1 in our Supp. for more details. |
| Open Source Code | No | The paper provides a project webpage URL (http://cvlab.postech.ac.kr/research/RSA/) on the first page, but it does not contain an explicit statement that the source code for the described methodology is directly available at this link or in supplementary materials. |
| Open Datasets | Yes | Something-something v1 & v2 (SS-V1 & V2) [14] are both large-scale action recognition benchmarks, including 108k and 220k action clips, respectively. [...] Diving-48 [30] is a fine-grained action benchmark [...] Fine Gym [39] is a motion-centric benchmark... |
| Dataset Splits | Yes | Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym. |
| Hardware Specification | No | The paper mentions an 'out-of-memory error' during an ablation study but does not provide any specific details about the hardware (e.g., GPU model, CPU, RAM) used for experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers, such as Python or deep learning framework versions. |
| Experiment Setup | Yes | We use TSN-Res Net50 [54] as our backbone and replace the standard spatial convolution layers by spatio-temporal RSA layers for every two Res Net bottlenecks [17]. Unless specified otherwise, we replace 7 RSA layers, there are 7 RSA layers in total where L = 8, D = CQ, M=5 7 7. We set the input and output channel dimensions of RSA layers to be equal to those of spatial convolution layers in TSN-Res Net50. For initialization, we randomly initialize the weights of bottlenecks including RSA layers with the MSRA method [16] and use Image Net pre-trained weights for all the other layers. We set the gamma parameter of the last batch normalization layer to zero. |