reproducibilityindex.ai

Relational Self-Attention: What's Missing in Attention for Video Understanding

Authors: Manjin Kim, Heeseung Kwon, CHUNYU WANG, Suha Kwak, Minsu Cho

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym.
Researcher Affiliation	Collaboration	Manjin Kim1 Heeseung Kwon1 Chunyu Wang2 Suha Kwak1 Minsu Cho1 1POSTECH 2Microsoft Research Asia
Pseudocode	Yes	Please refer to the pseudo code Fig. 1 in our Supp. for more details.
Open Source Code	No	The paper provides a project webpage URL (http://cvlab.postech.ac.kr/research/RSA/) on the first page, but it does not contain an explicit statement that the source code for the described methodology is directly available at this link or in supplementary materials.
Open Datasets	Yes	Something-something v1 & v2 (SS-V1 & V2) [14] are both large-scale action recognition benchmarks, including 108k and 220k action clips, respectively. [...] Diving-48 [30] is a ﬁne-grained action benchmark [...] Fine Gym [39] is a motion-centric benchmark...
Dataset Splits	Yes	Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and Fine Gym.
Hardware Specification	No	The paper mentions an 'out-of-memory error' during an ablation study but does not provide any specific details about the hardware (e.g., GPU model, CPU, RAM) used for experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers, such as Python or deep learning framework versions.
Experiment Setup	Yes	We use TSN-Res Net50 [54] as our backbone and replace the standard spatial convolution layers by spatio-temporal RSA layers for every two Res Net bottlenecks [17]. Unless speciﬁed otherwise, we replace 7 RSA layers, there are 7 RSA layers in total where L = 8, D = CQ, M=5 7 7. We set the input and output channel dimensions of RSA layers to be equal to those of spatial convolution layers in TSN-Res Net50. For initialization, we randomly initialize the weights of bottlenecks including RSA layers with the MSRA method [16] and use Image Net pre-trained weights for all the other layers. We set the gamma parameter of the last batch normalization layer to zero.