Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies
Authors: Paul Vicol, Luke Metz, Jascha Sohl-Dickstein
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally demonstrate the advantages of PES compared to several other methods for gradient estimation on synthetic tasks, and show its applicability to training learned optimizers and tuning hyperparameters. |
| Researcher Affiliation | Collaboration | 1University of Toronto 2Google Brain. Correspondence to: Paul Vicol <pvicol@cs.toronto.edu>. |
| Pseudocode | Yes | Algorithm 1: Truncated Evolution Strategies (ES) applied to partial unrolls of a computation graph. ... Algorithm 2: Persistent evolution strategies (PES). Differences from ES are highlighted in purple. |
| Open Source Code | Yes | A simplified code snippet implementing PES is provided in Appendix M. |
| Open Datasets | Yes | We used an LSTM with 5 hidden units and 5-dimensional embeddings, for character-level language modeling on the Penn Treebank corpus (Marcus et al., 1993)... train an MLP on CIFAR-10... We train a linear policy on the Swimmer-v2 Mu Jo Co environment... MNIST |
| Dataset Splits | No | The paper mentions using a 'validation set' for MNIST in Section 5.4 ('Targeting Validation Accuracy'), but it does not provide specific dataset split information (exact percentages, sample counts, or detailed methodology) for how the validation split was created or what its size is. |
| Hardware Specification | Yes | We outer-train on 8 TPUv2 cores with asynchronous, batched updates of size 16. |
| Software Dependencies | No | The paper states that 'All experiments used JAX (Bradbury et al., 2018),' but it does not provide specific version numbers for JAX or any other software libraries or dependencies used, which is required for reproducibility. |
| Experiment Setup | Yes | Algorithm 1 and 2 list σ (standard deviation of perturbations) and α (learning rate) as input parameters. ... We outer-train with Adam, using a learning rate of 10 4 selected via grid search over half-orders of magnitude for each method independently. We use gradient clipping of 3 applied to each gradient coordinate. ... αt = θ0 (1+ t Q) θ1 , where αt is the learning rate at step t, θ0 is the initial learning rate, θ1 is the decay factor, and Q is a constant fixed to 5000. This schedule is used for SGD with fixed momentum 0.9. The full unrolled inner problem consists of T = 5000 optimization steps, and we consider using vanilla ES and PES with truncation lengths K {10, 100}, yielding 500 and 50 unrolls per inner problem. |