reproducibilityindex.ai

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

Authors: Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate the advantages of PES compared to several other methods for gradient estimation on synthetic tasks, and show its applicability to training learned optimizers and tuning hyperparameters.
Researcher Affiliation	Collaboration	1University of Toronto 2Google Brain. Correspondence to: Paul Vicol <pvicol@cs.toronto.edu>.
Pseudocode	Yes	Algorithm 1: Truncated Evolution Strategies (ES) applied to partial unrolls of a computation graph. ... Algorithm 2: Persistent evolution strategies (PES). Differences from ES are highlighted in purple.
Open Source Code	Yes	A simpliﬁed code snippet implementing PES is provided in Appendix M.
Open Datasets	Yes	We used an LSTM with 5 hidden units and 5-dimensional embeddings, for character-level language modeling on the Penn Treebank corpus (Marcus et al., 1993)... train an MLP on CIFAR-10... We train a linear policy on the Swimmer-v2 Mu Jo Co environment... MNIST
Dataset Splits	No	The paper mentions using a 'validation set' for MNIST in Section 5.4 ('Targeting Validation Accuracy'), but it does not provide specific dataset split information (exact percentages, sample counts, or detailed methodology) for how the validation split was created or what its size is.
Hardware Specification	Yes	We outer-train on 8 TPUv2 cores with asynchronous, batched updates of size 16.
Software Dependencies	No	The paper states that 'All experiments used JAX (Bradbury et al., 2018),' but it does not provide specific version numbers for JAX or any other software libraries or dependencies used, which is required for reproducibility.
Experiment Setup	Yes	Algorithm 1 and 2 list σ (standard deviation of perturbations) and α (learning rate) as input parameters. ... We outer-train with Adam, using a learning rate of 10 4 selected via grid search over half-orders of magnitude for each method independently. We use gradient clipping of 3 applied to each gradient coordinate. ... αt = θ0 (1+ t Q) θ1 , where αt is the learning rate at step t, θ0 is the initial learning rate, θ1 is the decay factor, and Q is a constant ﬁxed to 5000. This schedule is used for SGD with ﬁxed momentum 0.9. The full unrolled inner problem consists of T = 5000 optimization steps, and we consider using vanilla ES and PES with truncation lengths K {10, 100}, yielding 500 and 50 unrolls per inner problem.