SIMPLE: A Gradient Estimator for k-Subset Sampling

Authors: Kareem Ahmed, Zhe Zeng, Mathias Niepert, Guy Van den Broeck

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show improved performance on learning to explain and sparse linear regression. We provide an algorithm for computing the exact ELBO for the k-subset distribution, obtaining significantly lower loss compared to SOTA. We conduct experiments on four different tasks: 1) A synthetic experiment designed to test the bias and variance, as well as the average deviation of SIMPLE compared to a variety of well-established estimators in the literature. 2) A discrete k-subset Variational Auto-Encoder (DVAE) setting... 3) The learning to explain (L2X) setting... 4) A novel, yet simple task, sparse linear regression...
Researcher Affiliation Academia 1Dept. of Computer Science, UCLA 2Dept. of Computer Science, Stuttgart University {ahmedk,zhezeng,guyvdb}@cs.ucla.edu mathias.niepert@simtech.uni-stuttgart.de
Pseudocode Yes Algorithm 1 Pr Exactlyk(θ, n, k) Input: The logits θ of the distribution, the number of variables n, and the subset size k Output: pθ(P i zi = k)
Open Source Code Yes Our code will be made publicly available at github.com/UCLA-Star AI/SIMPLE.
Open Datasets Yes We include an experiment on the task of learning to explain (L2X) using the BEERADVOCATE dataset (Mc Auley et al., 2012) and The DVAE is trained to minimize the sum of reconstruction loss and KL-divergence... on MNIST.
Dataset Splits Yes The training set has 80k reviews for the aspect APPEARANCE and 70k reviews for all other aspects. ... We follow Niepert et al. (2021) in computing 10 different evenly sized validation/test splits of the 10k held out set and compute mean and standard deviation over 10 models, each trained on one split.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions 'Tensorflow 2.x' but does not specify exact version numbers for other key software dependencies or libraries.
Experiment Setup Yes We used the standard hyperparameter settings of Chen et al. (2018) and choose the temperature parameter t {0.1, 0.5, 1.0, 2.0} for all methods. We used the standard Adam settings and trained separate models for each aspect using MSE as point-wise loss ℓ. As in prior work, we use a batch size of 100 and train for 100 epochs, plotting the test loss after each epoch. We use the standard Adam settings in Tensorflow 2.x, and do not employ any learning rate scheduling. The encoder network consists of an input layer with dimension 784 (we flatten the images), a dense layer with dimension 512 and Re Lu activation, a dense layer with dimension 256 and Re Lu activation, and a dense layer with dimension 400(20 20) which outputs θ and no non-linearity SIMPLE takes θ as input and outputs a discrete latent code of size 20 20. The decoder network, which takes this discrete latent code as input, consists of a dense layer with dimension 256 and Re Lu activation, a dense layer with dimension 512 and Re Lu activation, and finally a dense layer with dimension 784 returning the logits for the output pixels. Sigmoids are applied to these logits and the binary cross-entropy loss is computed. To obtain the best performing model of each of the compared methods, we performed a grid search over the learning rate in the range [1 10 3, 5 10 4], λ in the range [1 10 3, 1 10 2, 1 10 1, 1 100, 1 101, 1 102, 1 103], and for So G I-MLE, the temparature τ in the range [1 10 1, 1 100, 1 101, 1 102]