reproducibilityindex.ai

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization

Authors: Taisuke Yasuda, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The resulting algorithm that we propose, Sequential Attention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the Image Net and Criteo datasets.
Researcher Affiliation	Industry	Taisuke Yasuda* Voleon Group yasuda.taisuke1@gmail.com Kyriakos Axiotis Google Research axiotis@google.com Gang Fu Google Research thomasfu@google.com Mohammad Hossein Bateni Google Research bateni@google.com Vahab Mirrokni Google Research mirrokni@google.com
Pseudocode	Yes	Algorithm 1 Feed-forward layer with the basic version of Sequential Attention++ to select top 𝑘 parameters from a kernel W. [...] Algorithm 2 Attention mask. We omit SPARSIFICATION phases for simplicity.
Open Source Code	No	We plan to release the code used in experiments if accepted.
Open Datasets	Yes	We evaluate our algorithms on sparsiﬁcation tasks where a dense DNN is approximated by blocksparse counterparts, at various block sizes 𝐵and sparsities 𝑝, where a sparsity 𝑝indicates that the DNN layer will only have a 1 𝑝fraction of nonzero entries, and a block size of 𝐵indicates that the nonzero entries are arranged in 𝐵 𝐵blocks. Note that for a ﬁxed sparsity, larger block sizes generally translate to improved efﬁciency due to improved hardware utilization, but also degrades quality. Block size of 1 corresponds to unstructured pruning. Our experiments are performed on the Image Net and Criteo datasets.
Dataset Splits	Yes	Our results on Image Net are summarized in Table 1. The sparsities range over 58-95% and the block sizes over 8, 16, 32, 64. We compare ACDC and Sequential Attention++. Our ACDC implementation closely follows the implementation in Peste et al. [2021]3. We use the phase schedule suggested by Kuznedelev et al. [2023b] (10% dense, 7 equal SPARSE-DENSE phases where the last dense phase is extended by 5%, 15% sparse). For Sequential Attention++, we additionally replace each sparse-dense [...] We use Res Net50 and a standard training setup (90 epochs, SGD with cosine learning rate and momentum, weight decay). [...] Our dense baseline validation accuracy is 76.90. The dashes are results where the algorithms diverged because of extreme sparsity. The sparsities where chosen as 70%, 80%, 90%, 95%.
Hardware Specification	No	The paper does not explicitly state the specific hardware used (e.g., GPU model, CPU type, memory). The NeurIPS checklist indicates this information would be released with the code if accepted, implying it's not in the paper itself.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers. The NeurIPS checklist indicates this information would be released with the code if accepted, implying it's not in the paper itself.
Experiment Setup	Yes	We use Res Net50 and a standard training setup (90 epochs, SGD with cosine learning rate and momentum, weight decay). [...] We use the phase schedule suggested by Kuznedelev et al. [2023b] (10% dense, 7 equal SPARSE-DENSE phases where the last dense phase is extended by 5%, 15% sparse). [...] We use a batch size of 2048 and a maximum learning rate of 0.8. [...] We use Adam optimizer with a learning rate that decays exponentially from 2 10 2 to 3 10 4. We train to minimize the cross-entropy loss for 25 epochs with a batch size of 32768.