Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A VAE for Transformers with Nonparametric Variational Information Bottleneck
Authors: James Henderson, Fabio James Fehr
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity.To support our theoretical contributions, we provide proof-of-concept experiments which demonstrate that our proposed NVIB regulariser performs as claimed. |
| Researcher Affiliation | Academia | James Henderson Idiap Research Institute, Switzerland EMAIL Fabio Fehr Idiap Research Institute and EPFL, Switzerland EMAIL |
| Pseudocode | No | The paper provides mathematical derivations and implementation descriptions but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is available at https://github.com/idiap/nvib and https://github.com/idiap/nvib transformers. |
| Open Datasets | Yes | The Wikitext-2 and Wikitext-103 (Merity et al., 2017) encyclopedia datasets were selected as they are general English language corpora of a small and large scale containing high quality Wikipedia articles. |
| Dataset Splits | Yes | Dataset statistics can be found in Table 2. Train/Val/Test Tokens Wikitext-2 77K/8K/9K 26 12 Wikitext-103 3578K/9K/8K 25 10 |
| Hardware Specification | Yes | Each model experiment takes approximately 2hrs to run on a single NVIDIA Ge Force RTX 3090. and Each model experiment takes approximately 24hrs to run on a single NVIDIA Tesla v100, which was the largest compute within budget. |
| Software Dependencies | No | The paper mentions software like 'BERT base-uncased tokeniser', 'Adam optimiser', 'NLTK toolkit', 'BERT tokeniser', and 'Py Torch' but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | We use a two layer Transformer encoder and decoder with a single attention-head. The size for the word embedding vectors and model projections are 256, feed forward dimensions 1024...a constant learning rate of 1e 4, Adam optimiser (Kingma & Ba, 2015), a batch size of 256, gradient norm clipping 0.1 and trained for 50 epochs ( 15K steps). and All combinations of the following hyperparameters were considered in a grid search for the respective models: λ G={1, 1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 0} λ D ={10, 1, 1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 0} α ={1, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 0} κ ={1, 2, 5} S ={0.9, 0.8, 0.75, 0.5, 0.25} P ={mean, max, one} |