PDP: Parameter-free Differentiable Pruning is All You Need

Authors: Minsik Cho, Saurabh Adya, Devang Naik

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared our PDP with state-of-the-art random, structured, and channel pruning schemes on various computer vision and natural language models. We used two x86 Linux nodes with eight NVIDIA V100 GPUs on each in a cloud environment.
Researcher Affiliation Industry Minsik Cho Saurabh Adya Devang Naik {minsik, sadya, naik.d}@apple.com
Pseudocode Yes Algorithm 1 Training flow for PDP
Open Source Code No The paper provides code references for other methods in Section G, but does not explicitly state that the source code for PDP (their own methodology) is publicly available or provide a link to it.
Open Datasets Yes We compared the proposed PDP with the latest prior arts... on Res Net18, Res Net50, Mobile Net-v1, and Mobile Net-v2 [20, 22, 43] with the Image Net1k dataset [9]." and "We compared PDP with the state-of-the-art pruning results from MVP [44] and POFA [56] (quoted from the respective papers) in addition to Opt G and STR (reproduced in our environment) on a BERT model [10] for the two largest NLP tasks of the GLUE benchmark, MNLI (Multi-Genre Natural Language Inference) with 392,702 samples and QQP (Question-answering Natural Language Inference) with 363,836 samples.
Dataset Splits Yes For PDP, we had the following variants to show the value of PDP with the same training overhead or per-layer pruning budgets. ... We use the weights of 0.95 on the distillation loss and 0.05 on the task loss for PDP, and 0.75 on the distillation loss and 0.25 on the task loss for STR. ... After r epochs, the validation accuracy hits consistently over the half of the accuracy upper-bound (which is 50% for classifications) for 5 epochs.
Hardware Specification Yes We used two x86 Linux nodes with eight NVIDIA V100 GPUs on each in a cloud environment.
Software Dependencies No The paper mentions optimizers like SGD and AdamW and schedulers, but does not provide specific version numbers for software libraries (e.g., PyTorch, TensorFlow) or programming languages beyond generic references.
Experiment Setup Yes Table 8: The hyper-parameters in Sections 3 and 4. (This table provides detailed information on batch size, number of epochs, optimizers, learning rates, and other parameters used for various models and methods.)