A VAE for Transformers with Nonparametric Variational Information Bottleneck

Authors: James Henderson, Fabio James Fehr

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity.To support our theoretical contributions, we provide proof-of-concept experiments which demonstrate that our proposed NVIB regulariser performs as claimed.
Researcher Affiliation Academia James Henderson Idiap Research Institute, Switzerland james.henderson@idiap.ch Fabio Fehr Idiap Research Institute and EPFL, Switzerland fabio.fehr@idiap.ch
Pseudocode No The paper provides mathematical derivations and implementation descriptions but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code is available at https://github.com/idiap/nvib and https://github.com/idiap/nvib transformers.
Open Datasets Yes The Wikitext-2 and Wikitext-103 (Merity et al., 2017) encyclopedia datasets were selected as they are general English language corpora of a small and large scale containing high quality Wikipedia articles.
Dataset Splits Yes Dataset statistics can be found in Table 2. Train/Val/Test Tokens Wikitext-2 77K/8K/9K 26 12 Wikitext-103 3578K/9K/8K 25 10
Hardware Specification Yes Each model experiment takes approximately 2hrs to run on a single NVIDIA Ge Force RTX 3090. and Each model experiment takes approximately 24hrs to run on a single NVIDIA Tesla v100, which was the largest compute within budget.
Software Dependencies No The paper mentions software like 'BERT base-uncased tokeniser', 'Adam optimiser', 'NLTK toolkit', 'BERT tokeniser', and 'Py Torch' but does not specify their version numbers, which is required for reproducibility.
Experiment Setup Yes We use a two layer Transformer encoder and decoder with a single attention-head. The size for the word embedding vectors and model projections are 256, feed forward dimensions 1024...a constant learning rate of 1e 4, Adam optimiser (Kingma & Ba, 2015), a batch size of 256, gradient norm clipping 0.1 and trained for 50 epochs ( 15K steps). and All combinations of the following hyperparameters were considered in a grid search for the respective models: λ G={1, 1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 0} λ D ={10, 1, 1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 0} α ={1, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 0} κ ={1, 2, 5} S ={0.9, 0.8, 0.75, 0.5, 0.25} P ={mean, max, one}