Is Attention Better Than Matrix Decomposition?

Authors: Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, Zhouchen Lin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.
Researcher Affiliation Collaboration Zhengyang Geng1,2, Meng-Hao Guo3 , Hongxu Chen4, Xia Li2, Ke Wei4, Zhouchen Lin2,5 1Zhejiang Lab; 2Key Lab. of Machine Perception (Mo E), School of EECS, Peking University; 3Tsinghua University; 4School of Data Science, Fudan University; 5Pazhou Lab
Pseudocode Yes Algorithm 1 Ham: Soft VQ Algorithm 2 Ham: NMF with MU Algorithm 3 Ham: Soft CD
Open Source Code Yes Code is available.
Open Datasets Yes We choose to conduct all ablation experiments on the PASCAL VOC dataset (Everingham et al., 2010) for semantic segmentation... We benchmark Hamburger on the PASCAL VOC dataset (Everingham et al., 2010), and the PASCAL Context dataset (Mottaghi et al., 2014)... It is convincing to benchmark MDbased Hamburger in the challenging conditional image generation task on Image Net (Deng et al., 2009).
Dataset Splits Yes For segmentation, it contains 10,582 images for training, 1,449 images for validation and 1,456 images for testing.
Hardware Specification Yes We report m Io U of 5 runs on the validation set in the form of best(mean). Res Net-50 (He et al., 2016) with output stride 16 is the backbone for all ablation experiments... less than 12 hours using 1 NVIDIA TITAN Xp GPU... on the same TPUv3 training platform.
Software Dependencies No The paper mentions software like PyTorch and TensorFlow, and references their original papers, but does not specify exact version numbers for these or other libraries used for the experiments. For instance, it states "requiring a no_grad operation in Py Torch (Paszke et al., 2019) or stop_gradient operation in Tensor Flow (Abadi et al., 2016)" but no version numbers.
Experiment Setup Yes We apply a poly-learning rate policy under batch size 12 and 30k iterations (about 35 epochs) for fast experiments (less than 12 hours using 1 NVIDIA TITAN Xp GPU). The initial learning rate is set to 0.009, multiplied by (1 iter itermax )0.9 after each iteration, with momentum 0.9 and weight decay 0.0001. Latent dimension d and r... are set to 512 and 64. The iterations of MD s optimization algorithm, K, are set to 6.