On the Adversarial Robustness of Mixture of Experts
Authors: Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, Srinadh Bhojanapalli
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We next empirically evaluate the robustness of Mo Es on Image Net using adversarial attacks and show they are indeed more robust than dense models with the same computational cost. We make key observations showing the robustness of Mo Es to the choice of experts, highlighting the redundancy of experts in models trained in practice. |
| Researcher Affiliation | Industry | Joan Puigcerver Google Research Rodolphe Jenatton Google Research Carlos Riquelme Google Research Pranjal Awasthi Google Research Srinadh Bhojanapalli Google Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We pre-train our models on the private dataset JFT-300M [24] for 7 epochs... After pre-training, the models are fine-tuned on Image Net [6], at a resolution of 384 384 pixels... |
| Dataset Splits | No | The paper mentions using JFT-300M and ImageNet datasets but does not provide specific details on training/validation/test splits, such as percentages, sample counts, or explicit splitting methodology for validation. |
| Hardware Specification | No | No specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers. |
| Experiment Setup | Yes | We pre-train our models on the private dataset JFT-300M [24] for 7 epochs (517 859 steps with a batch size of 4 096 images), using an image resolution of 224 224 pixels, and standard data augmentation (inception crop and horizontal flips). ... In both cases we use Adam (β1 = 0.9, β2 = 0.999), with a peak learning rate of 8 10 4, reached after a linear warm-up of 104 steps and then linearly decayed to a final value of 10 5. Weight decay of 0.1 was used on all parameters. ... After pre-training, the models are fine-tuned on Image Net [6], at a resolution of 384 384 pixels and the same data augmentations as before, for a total of 104 steps, using a batch size of 4 096 images. SGD with Momentum (µ = 0.9) is used for fine-tuning, with a peak learning rate of 0.03, reached after a linear warm-up of 500 steps, and followed with cosine decay to a final value of 10 5. The norm of the flattened vector of gradients is clipped to a maximum value of 10. |