Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

Authors: Zi Wang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments In this section, we first demonstrate the performance of DB3KD when training samples are accessible. Then we show the results of ZSDB3KD under the circumstance that training data is not accessible.
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN, USA. Correspondence to: Zi Wang <zwang84@vols.utk.edu>.
Pseudocode No The paper includes 'Figure 3. The iterative procedure for the optimization of MBD', which is a flowchart, but it does not contain structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm' or 'Pseudocode').
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology or a link to a code repository.
Open Datasets Yes A Le Net-5 (Le Cun et al., 1998) with two convolutional layers is pre-trained on MNIST (Le Cun et al., 1998) as the teacher... (2) The same teacher and student networks as in (1) are used but are trained and evaluated on the Fashion-MNIST dataset. (3) An Alex Net (Krizhevsky et al., 2012) pre-trained on CIFAR-10 (Krizhevsky et al., 2009) is used as the teacher. (4) A Res Net-34 (He et al., 2016) pre-trained on the high-resolution, fine-grained dataset FLOWERS102 (Nilsback & Zisserman, 2008) is used as the teacher, and the student is a Res Net-18.
Dataset Splits No The paper states training parameters like epochs, learning rate, and optimizer, and mentions using random seeds for runs, but it does not explicitly provide specific details on validation dataset splits (e.g., percentages or sample counts for validation data) used to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using an 'Adam optimizer' and implies the use of a deep learning framework (likely PyTorch, given the context of related works and common practices for such research), but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We train the student networks for 100 epochs, using an Adam optimizer (learning rate 5e 3), for all the datasets except for FLOWERS102, which is trained for 200 epochs. The scaling factor λ is set to 1 for simplicity. With a hyperparameter search, we find that smaller τs between 0.2 and 1.0 leads to good performance. We use τ = 0.3 in our experiments. All experiments are evaluated for 5 runs with random seeds. ... For DB3KD-SD, we use 100 samples from each class to compute the sample robustness... ϵ is set to 1e 5... In DB3KDMBD, we use 200 Gaussian random vectors to estimate the gradient and try different numbers of queries from 1000 to 20000 with ξd = 0.2 to optimize the MBD... The sample robustness are calculated in parallel with a batch size of 20 with FLOWERS102, and 200 with the other datasets. ... We optimize the pseudo samples for 40 (ξo = 0.5) and 100 iterations (ξo = 3.0) for the two Le Net-5 and the Alex Net experiments, respectively. The query is limited to 5000 when iteratively searching for the MBD. We generate 8000 samples for each class with a batch size of 200 for all the experiments.