reproducibilityindex.ai

On the Adversarial Robustness of Mixture of Experts

Authors: Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, Srinadh Bhojanapalli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We next empirically evaluate the robustness of Mo Es on Image Net using adversarial attacks and show they are indeed more robust than dense models with the same computational cost. We make key observations showing the robustness of Mo Es to the choice of experts, highlighting the redundancy of experts in models trained in practice.
Researcher Affiliation	Industry	Joan Puigcerver Google Research Rodolphe Jenatton Google Research Carlos Riquelme Google Research Pranjal Awasthi Google Research Srinadh Bhojanapalli Google Research
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We pre-train our models on the private dataset JFT-300M [24] for 7 epochs... After pre-training, the models are ﬁne-tuned on Image Net [6], at a resolution of 384 384 pixels...
Dataset Splits	No	The paper mentions using JFT-300M and ImageNet datasets but does not provide specific details on training/validation/test splits, such as percentages, sample counts, or explicit splitting methodology for validation.
Hardware Specification	No	No specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers.
Experiment Setup	Yes	We pre-train our models on the private dataset JFT-300M [24] for 7 epochs (517 859 steps with a batch size of 4 096 images), using an image resolution of 224 224 pixels, and standard data augmentation (inception crop and horizontal ﬂips). ... In both cases we use Adam (β1 = 0.9, β2 = 0.999), with a peak learning rate of 8 10 4, reached after a linear warm-up of 104 steps and then linearly decayed to a ﬁnal value of 10 5. Weight decay of 0.1 was used on all parameters. ... After pre-training, the models are ﬁne-tuned on Image Net [6], at a resolution of 384 384 pixels and the same data augmentations as before, for a total of 104 steps, using a batch size of 4 096 images. SGD with Momentum (µ = 0.9) is used for ﬁne-tuning, with a peak learning rate of 0.03, reached after a linear warm-up of 500 steps, and followed with cosine decay to a ﬁnal value of 10 5. The norm of the ﬂattened vector of gradients is clipped to a maximum value of 10.