Differentiable Model Scaling using Differentiable Topk
Authors: Kai Liu, Ruohui Wang, Jianfei Gao, Kai Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on Image Net, our DMS improves the top-1 accuracy of Efficient Net-B0 and Deit-Tiny by 1.4% and 0.6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, Zi Co, by 1.3% while requiring only 0.4 GPU days for searching. For object detection on COCO, DMS improves the m AP of Yolo-v8-n by 2.0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. |
| Researcher Affiliation | Academia | 1Shanghai AI Laboratory, Shanghai, China. Correspondence to: Kai Liu <liukai@pjlab.org.cn>, Kai Chen <chenkai@pjlab.org.cn>. |
| Pseudocode | No | The paper describes the proposed method in prose and through diagrams (e.g., Figure 2 for forward and backward graph of Differentiable Topk) but does not include any explicit pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | Yes | Our code is available at https://github.com/LKJacky/ Differentiable-Model-Scaling. |
| Open Datasets | Yes | Specifically, for image classification on Image Net... For object detection on COCO... For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. Specifically, we observe reduced perplexity on Wiki Text2 (Merity et al., 2016) and Pth (Marcus et al., 1993), and higher zero-shot classification accuracy on Bool Q (Clark et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC-e (Clark et al., 2018), and ARC-c (Clark et al., 2018). |
| Dataset Splits | No | The paper discusses training, searching, and retraining phases but does not explicitly provide details about specific training/validation/test dataset splits (e.g., percentages or sample counts) for any of the datasets used (ImageNet, COCO, WikiText2, etc.). It refers to standard datasets and settings but lacks explicit reproducibility information for data partitioning. |
| Hardware Specification | Yes | it only takes hundreds of iterations, costing less than 10 minutes on a single RTX3090, to search for a model. |
| Software Dependencies | No | The paper mentions using 'Timm library (Wightman, 2019)', 'MMPretrain (Contributors, 2023)', and 'MMYolo (Contributors, 2022)'. While these libraries are cited, specific version numbers for these, or for fundamental software like Python, PyTorch, or CUDA, are not explicitly provided. |
| Experiment Setup | Yes | In general, given a baseline model and a training setting, we enlarge the baseline model as our supernet and decrease the number of epochs of the training setting as our searching setting. We list details of our experiment setting as shown below. Efficient Net: For all DMSnp-ES variants, we pruned the supernets over a span of 30 epochs. For those DMSnp-ES variants with MACs fewer than 0.5G, the pruning was conducted from Efficient Net-B4, using an input size of 224. Meanwhile, for DMSnp-EN-B1 and B2, the pruning was initiated from Efficient Net-B7. The input sizes for DMSnp-EN-B1 and B2 were 256 and 288, respectively. Subsequently, the DMSnp-EN variants were retrained using the corresponding training scripts of Efficient Net available in the Timm library (Wightman, 2019). |