Relational Self-Attention: What's Missing in Attention for Video Understanding

Authors: Manjin Kim, Heeseung Kwon, CHUNYU WANG, Suha Kwak, Minsu Cho

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym.
Researcher Affiliation Collaboration Manjin Kim1 Heeseung Kwon1 Chunyu Wang2 Suha Kwak1 Minsu Cho1 1POSTECH 2Microsoft Research Asia
Pseudocode Yes Please refer to the pseudo code Fig. 1 in our Supp. for more details.
Open Source Code No The paper provides a project webpage URL (http://cvlab.postech.ac.kr/research/RSA/) on the first page, but it does not contain an explicit statement that the source code for the described methodology is directly available at this link or in supplementary materials.
Open Datasets Yes Something-something v1 & v2 (SS-V1 & V2) [14] are both large-scale action recognition benchmarks, including 108k and 220k action clips, respectively. [...] Diving-48 [30] is a fine-grained action benchmark [...] Fine Gym [39] is a motion-centric benchmark...
Dataset Splits Yes Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym.
Hardware Specification No The paper mentions an 'out-of-memory error' during an ablation study but does not provide any specific details about the hardware (e.g., GPU model, CPU, RAM) used for experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as Python or deep learning framework versions.
Experiment Setup Yes We use TSN-Res Net50 [54] as our backbone and replace the standard spatial convolution layers by spatio-temporal RSA layers for every two Res Net bottlenecks [17]. Unless specified otherwise, we replace 7 RSA layers, there are 7 RSA layers in total where L = 8, D = CQ, M=5 7 7. We set the input and output channel dimensions of RSA layers to be equal to those of spatial convolution layers in TSN-Res Net50. For initialization, we randomly initialize the weights of bottlenecks including RSA layers with the MSRA method [16] and use Image Net pre-trained weights for all the other layers. We set the gamma parameter of the last batch normalization layer to zero.