RareGAN: Generating Samples for Rare Classes

Authors: Zinan Lin, Hao Liang, Giulia Fanti, Vyas Sekar7506-7515

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Rare GAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures. Table 1 shows that Rare GAN achieves better fidelity and diversity (with a smaller labeling budget) when generating DNS amplification attack packets, compared to a state-of-the-art domain-specific technique (Moon et al. 2021). Although Rare GAN is primarily motivated from the applications in security, networking, and systems, we also consider image generation, both as a useful tool in its own right and to visualize the improvements. Fig. 1 shows generated samples trained on a modified MNIST handwritten digit dataset (Le Cun et al. 1998) where we artificially forcing 0 digit as the rare class (1% of the training data). ACGAN (Odena, Olah, and Shlens 2017), ALCG (Xie and Huang 2019), and BAGAN (Mariani et al. 2018) produces severely mode-collapsed samples. Elastic-Info GAN (Ojha et al. 2019) produces samples from the wrong class. GAN memorizes the training dataset. Rare GAN (bottom) produces high-quality, diverse samples from the correct class without memorizing the training data. We conduct experiments on all three applications in 1. The code can be found at https://github.com/fjxmlzn/Rare GAN. Unless otherwise specified, the default configurations are: the number of stages S = 2 (for Rare GAN and ALCG), weight w = 3 (for Rare GAN); in DNS, labeling budget B = 200,000, rare class fraction α = 0.776% (corresponding to T = 10); in packet classification, B = 200,000, α = 1.150% (corresponding to T = 0.055); in MNIST, B = 5, 000, α = 1%; in CIFAR10, B = 10,000, α = 10%. All experiments are run over 5 random seeds.
Researcher Affiliation Academia Zinan Lin, Hao Liang, Giulia Fanti, Vyas Sekar Carnegie Mellon University
Pseudocode No The paper describes its methods in prose and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code can be found at https://github.com/fjxmlzn/Rare GAN.
Open Datasets Yes Fig. 1 shows generated samples trained on a modified MNIST handwritten digit dataset (Le Cun et al. 1998) where we artificially forcing 0 digit as the rare class (1% of the training data). Following the settings of related work (Mariani et al. 2018), we simulate the imbalanced dataset with widely-used datasets: MNIST (Le Cun et al. 1998) and CIFAR10 (Krizhevsky 2009).
Dataset Splits No While the paper mentions training and testing, and uses well-known datasets, it does not explicitly state the specific proportions or counts for training, validation, and test splits. It references original train/test splits for MNIST/CIFAR10 in the appendix but does not specify a validation split.
Hardware Specification No The paper mentions using computing platforms like Cloudlab and the Bridges system at the Pittsburgh Supercomputing Center (PSC) (part of XSEDE) but does not specify exact GPU/CPU models, processor types, or memory details.
Software Dependencies No The paper mentions using specific GAN types like ACGAN, Wasserstein GAN, and Jensen Shannon divergence, and refers to a TensorFlow implementation. However, it does not provide specific version numbers for TensorFlow or any other software libraries or dependencies.
Experiment Setup Yes Unless otherwise specified, the default configurations are: the number of stages S = 2 (for Rare GAN and ALCG), weight w = 3 (for Rare GAN); in DNS, labeling budget B = 200,000, rare class fraction α = 0.776% (corresponding to T = 10); in packet classification, B = 200,000, α = 1.150% (corresponding to T = 0.055); in MNIST, B = 5, 000, α = 1%; in CIFAR10, B = 10,000, α = 10%. For the first two applications, the generators and discriminators are MLPs. The GAN loss is Wasserstein distance (Eq. (2))... For the image datasets, we follow the popular public ACGAN implementation (Lee 2018), where the generator and discriminator are CNNs, and the GAN loss is Jensen Shannon divergence (Eq. (1)).