Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies

Authors: Alon Berliner, Guy Rotman, Yossi Adi, Roi Reichart, Tamir Hazan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that optimizing discrete structured VAEs using NES is as effective as gradient-based approximations. Lastly, we prove NES converges for non-Lipschitz functions as appear in discrete structured VAEs.
Researcher Affiliation Collaboration Alon Berliner Technion, IIT alon.berliner@gmail.com Guy Rotman Technion, IIT rotmanguy@gmail.com Yossi Adi Meta AI Research adiyoss@fb.com Roi Reichart Technion, IIT roiri@technion.ac.il Tamir Hazan Technion, IIT tamir.hazan@technion.ac.il
Pseudocode Yes Algorithm 1 Natural Evolution Strategies for discrete VAEs
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the methodology described.
Open Datasets Yes In our experiments, we utilize the dataset developed by Paulus et al. [48]... We consider the Universal Dependencies (UD) dataset [35, 44, 45]... The experiments were conducted on the Fashion MNIST dataset [59] with fixed binarization [51]... Experiments are conducted on the Fashion MNIST [59], KMNIST [7], and Omniglot [29] datasets with fixed binarization [51].
Dataset Splits Yes All reported values are measured on a test set, and the models were selected using early stopping on the validation set.
Hardware Specification Yes All the following experiments were conducted using an internal cluster with 4 Tesla-K80 NVIDIA GPUs.
Software Dependencies No The paper mentions software like
Experiment Setup Yes We run our experiments with the same set of parameters as in Paulus et al. [48], except that during decoding we use teacher-forcing every 3 steps instead of 9 steps. We fix NES parameters to be σ = 0.01 and N = 600... We set the hyper-parameters to those of the original implementation of Kiperwasser & Goldberg [22] and feed the models with the multilingual Fast Text word embeddings [16]. We perform a grid-search for each of the methods separately over learning rates in [5 10 4, 1 10 5] and set the mini-batch size to 128. We fix NES parameters to be σ = 0.1 and N = 400. Adam optimizer [21] is used to optimize all methods... All models were trained using the ADAM optimizer [21] over 300 epochs with a constant learning rate of 10 3 and a batch size of 128.