Low-Variance Gradient Estimation in Unrolled Computation Graphs with ES-Single

Authors: Paul Vicol

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated ES-Single on a diverse set of tasks, from synthetic problems designed to test unbiasedness, to hyperparameter optimization, RNN training, and meta-training learned optimizers. We found that ES-Single outperformed PES across all tasks we investigated.
Researcher Affiliation Industry 1Google Brain. Correspondence to: Paul Vicol <paulvicol@google.com>.
Pseudocode Yes Algorithm 2 ES with a single perturbation per particle reapplied in each truncated unroll (ES-Single).
Open Source Code Yes We provide JAX code for ES-Single in Appendix H, and a Colab notebook implementation here.
Open Datasets Yes We consider a tiny LSTM trained on the character-level Penn Tree Bank dataset (Marcus et al., 1993). We also revisited the UCI linear regression task used in Vicol et al. (2021), which demonstrates that truncation bias can also affect regularization hyperparameters... on the UCI Yacht dataset (Asuncion & Newman, 2007). Here, we used ES-Single to meta-learn a learning rate (LR) schedule used to train an MLP on MNIST. ...tuning the learning rate and decay factor for training an MLP on Fashion MNIST... train a Res Net on CIFAR-10.
Dataset Splits No The paper mentions using a 'validation set' and 'sum of validation losses' as meta-objectives, for example, 'The meta-objective is the sum of validation losses over the inner problem.' However, it does not specify exact split percentages or absolute sample counts for train/validation/test datasets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using 'JAX code' in Appendix H but does not specify a version number for JAX or any other software dependencies used in the experiments.
Experiment Setup Yes For all approaches (vanilla ES, PES, and ES-Single), we use antithetic sampling. For all methods, we performed outer optimization using Adam with learning rate 0.01. The total inner problem length is T = 5000, which is split into 500 partial unrolls of length K = 10. We used N = 1000 particles and σ = 0.1 for each estimator, and we used Adam with learning rate 1e-2 for outer optimization.