reproducibilityindex.ai

RareGAN: Generating Samples for Rare Classes

Authors: Zinan Lin, Hao Liang, Giulia Fanti, Vyas Sekar7506-7515

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Rare GAN achieves a better ﬁdelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures. Table 1 shows that Rare GAN achieves better ﬁdelity and diversity (with a smaller labeling budget) when generating DNS ampliﬁcation attack packets, compared to a state-of-the-art domain-speciﬁc technique (Moon et al. 2021). Although Rare GAN is primarily motivated from the applications in security, networking, and systems, we also consider image generation, both as a useful tool in its own right and to visualize the improvements. Fig. 1 shows generated samples trained on a modiﬁed MNIST handwritten digit dataset (Le Cun et al. 1998) where we artiﬁcially forcing 0 digit as the rare class (1% of the training data). ACGAN (Odena, Olah, and Shlens 2017), ALCG (Xie and Huang 2019), and BAGAN (Mariani et al. 2018) produces severely mode-collapsed samples. Elastic-Info GAN (Ojha et al. 2019) produces samples from the wrong class. GAN memorizes the training dataset. Rare GAN (bottom) produces high-quality, diverse samples from the correct class without memorizing the training data. We conduct experiments on all three applications in 1. The code can be found at https://github.com/fjxmlzn/Rare GAN. Unless otherwise speciﬁed, the default conﬁgurations are: the number of stages S = 2 (for Rare GAN and ALCG), weight w = 3 (for Rare GAN); in DNS, labeling budget B = 200,000, rare class fraction α = 0.776% (corresponding to T = 10); in packet classiﬁcation, B = 200,000, α = 1.150% (corresponding to T = 0.055); in MNIST, B = 5, 000, α = 1%; in CIFAR10, B = 10,000, α = 10%. All experiments are run over 5 random seeds.
Researcher Affiliation	Academia	Zinan Lin, Hao Liang, Giulia Fanti, Vyas Sekar Carnegie Mellon University
Pseudocode	No	The paper describes its methods in prose and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code can be found at https://github.com/fjxmlzn/Rare GAN.
Open Datasets	Yes	Fig. 1 shows generated samples trained on a modiﬁed MNIST handwritten digit dataset (Le Cun et al. 1998) where we artiﬁcially forcing 0 digit as the rare class (1% of the training data). Following the settings of related work (Mariani et al. 2018), we simulate the imbalanced dataset with widely-used datasets: MNIST (Le Cun et al. 1998) and CIFAR10 (Krizhevsky 2009).
Dataset Splits	No	While the paper mentions training and testing, and uses well-known datasets, it does not explicitly state the specific proportions or counts for training, validation, and test splits. It references original train/test splits for MNIST/CIFAR10 in the appendix but does not specify a validation split.
Hardware Specification	No	The paper mentions using computing platforms like Cloudlab and the Bridges system at the Pittsburgh Supercomputing Center (PSC) (part of XSEDE) but does not specify exact GPU/CPU models, processor types, or memory details.
Software Dependencies	No	The paper mentions using specific GAN types like ACGAN, Wasserstein GAN, and Jensen Shannon divergence, and refers to a TensorFlow implementation. However, it does not provide specific version numbers for TensorFlow or any other software libraries or dependencies.
Experiment Setup	Yes	Unless otherwise speciﬁed, the default conﬁgurations are: the number of stages S = 2 (for Rare GAN and ALCG), weight w = 3 (for Rare GAN); in DNS, labeling budget B = 200,000, rare class fraction α = 0.776% (corresponding to T = 10); in packet classiﬁcation, B = 200,000, α = 1.150% (corresponding to T = 0.055); in MNIST, B = 5, 000, α = 1%; in CIFAR10, B = 10,000, α = 10%. For the ﬁrst two applications, the generators and discriminators are MLPs. The GAN loss is Wasserstein distance (Eq. (2))... For the image datasets, we follow the popular public ACGAN implementation (Lee 2018), where the generator and discriminator are CNNs, and the GAN loss is Jensen Shannon divergence (Eq. (1)).