SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
Authors: Taisuke Yasuda, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The resulting algorithm that we propose, Sequential Attention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the Image Net and Criteo datasets. |
| Researcher Affiliation | Industry | Taisuke Yasuda* Voleon Group yasuda.taisuke1@gmail.com Kyriakos Axiotis Google Research axiotis@google.com Gang Fu Google Research thomasfu@google.com Mohammad Hossein Bateni Google Research bateni@google.com Vahab Mirrokni Google Research mirrokni@google.com |
| Pseudocode | Yes | Algorithm 1 Feed-forward layer with the basic version of Sequential Attention++ to select top 𝑘 parameters from a kernel W. [...] Algorithm 2 Attention mask. We omit SPARSIFICATION phases for simplicity. |
| Open Source Code | No | We plan to release the code used in experiments if accepted. |
| Open Datasets | Yes | We evaluate our algorithms on sparsification tasks where a dense DNN is approximated by blocksparse counterparts, at various block sizes 𝐵and sparsities 𝑝, where a sparsity 𝑝indicates that the DNN layer will only have a 1 𝑝fraction of nonzero entries, and a block size of 𝐵indicates that the nonzero entries are arranged in 𝐵 𝐵blocks. Note that for a fixed sparsity, larger block sizes generally translate to improved efficiency due to improved hardware utilization, but also degrades quality. Block size of 1 corresponds to unstructured pruning. Our experiments are performed on the Image Net and Criteo datasets. |
| Dataset Splits | Yes | Our results on Image Net are summarized in Table 1. The sparsities range over 58-95% and the block sizes over 8, 16, 32, 64. We compare ACDC and Sequential Attention++. Our ACDC implementation closely follows the implementation in Peste et al. [2021]3. We use the phase schedule suggested by Kuznedelev et al. [2023b] (10% dense, 7 equal SPARSE-DENSE phases where the last dense phase is extended by 5%, 15% sparse). For Sequential Attention++, we additionally replace each sparse-dense [...] We use Res Net50 and a standard training setup (90 epochs, SGD with cosine learning rate and momentum, weight decay). [...] Our dense baseline validation accuracy is 76.90. The dashes are results where the algorithms diverged because of extreme sparsity. The sparsities where chosen as 70%, 80%, 90%, 95%. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used (e.g., GPU model, CPU type, memory). The NeurIPS checklist indicates this information would be released with the code if accepted, implying it's not in the paper itself. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. The NeurIPS checklist indicates this information would be released with the code if accepted, implying it's not in the paper itself. |
| Experiment Setup | Yes | We use Res Net50 and a standard training setup (90 epochs, SGD with cosine learning rate and momentum, weight decay). [...] We use the phase schedule suggested by Kuznedelev et al. [2023b] (10% dense, 7 equal SPARSE-DENSE phases where the last dense phase is extended by 5%, 15% sparse). [...] We use a batch size of 2048 and a maximum learning rate of 0.8. [...] We use Adam optimizer with a learning rate that decays exponentially from 2 10 2 to 3 10 4. We train to minimize the cross-entropy loss for 25 epochs with a batch size of 32768. |