SIMPLE: A Gradient Estimator for k-Subset Sampling
Authors: Kareem Ahmed, Zhe Zeng, Mathias Niepert, Guy Van den Broeck
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show improved performance on learning to explain and sparse linear regression. We provide an algorithm for computing the exact ELBO for the k-subset distribution, obtaining significantly lower loss compared to SOTA. We conduct experiments on four different tasks: 1) A synthetic experiment designed to test the bias and variance, as well as the average deviation of SIMPLE compared to a variety of well-established estimators in the literature. 2) A discrete k-subset Variational Auto-Encoder (DVAE) setting... 3) The learning to explain (L2X) setting... 4) A novel, yet simple task, sparse linear regression... |
| Researcher Affiliation | Academia | 1Dept. of Computer Science, UCLA 2Dept. of Computer Science, Stuttgart University {ahmedk,zhezeng,guyvdb}@cs.ucla.edu mathias.niepert@simtech.uni-stuttgart.de |
| Pseudocode | Yes | Algorithm 1 Pr Exactlyk(θ, n, k) Input: The logits θ of the distribution, the number of variables n, and the subset size k Output: pθ(P i zi = k) |
| Open Source Code | Yes | Our code will be made publicly available at github.com/UCLA-Star AI/SIMPLE. |
| Open Datasets | Yes | We include an experiment on the task of learning to explain (L2X) using the BEERADVOCATE dataset (Mc Auley et al., 2012) and The DVAE is trained to minimize the sum of reconstruction loss and KL-divergence... on MNIST. |
| Dataset Splits | Yes | The training set has 80k reviews for the aspect APPEARANCE and 70k reviews for all other aspects. ... We follow Niepert et al. (2021) in computing 10 different evenly sized validation/test splits of the 10k held out set and compute mean and standard deviation over 10 models, each trained on one split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions 'Tensorflow 2.x' but does not specify exact version numbers for other key software dependencies or libraries. |
| Experiment Setup | Yes | We used the standard hyperparameter settings of Chen et al. (2018) and choose the temperature parameter t {0.1, 0.5, 1.0, 2.0} for all methods. We used the standard Adam settings and trained separate models for each aspect using MSE as point-wise loss ℓ. As in prior work, we use a batch size of 100 and train for 100 epochs, plotting the test loss after each epoch. We use the standard Adam settings in Tensorflow 2.x, and do not employ any learning rate scheduling. The encoder network consists of an input layer with dimension 784 (we flatten the images), a dense layer with dimension 512 and Re Lu activation, a dense layer with dimension 256 and Re Lu activation, and a dense layer with dimension 400(20 20) which outputs θ and no non-linearity SIMPLE takes θ as input and outputs a discrete latent code of size 20 20. The decoder network, which takes this discrete latent code as input, consists of a dense layer with dimension 256 and Re Lu activation, a dense layer with dimension 512 and Re Lu activation, and finally a dense layer with dimension 784 returning the logits for the output pixels. Sigmoids are applied to these logits and the binary cross-entropy loss is computed. To obtain the best performing model of each of the compared methods, we performed a grid search over the learning rate in the range [1 10 3, 5 10 4], λ in the range [1 10 3, 1 10 2, 1 10 1, 1 100, 1 101, 1 102, 1 103], and for So G I-MLE, the temparature τ in the range [1 10 1, 1 100, 1 101, 1 102] |