PALBERT: Teaching ALBERT to Ponder
Authors: Nikita Balagansky, Daniil Gavrilov
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimented with PALBERT and PRo BERTa on the GLUE Benchmark datasets (Wang et al., 2018). The ablation study showed that PALBERT produced significantly better results than the original Ponder Net architecture adapted for ALBERT fine-tuning. |
| Researcher Affiliation | Industry | Nikita Balagansky, Daniil Gavrilov Tinkoff n.n.balaganskiy@tinkoff.ai, d.gavrilov@tinkoff.ai |
| Pseudocode | No | The paper describes the proposed methods in text but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code: https://github.com/tinkoff-ai/palbert |
| Open Datasets | Yes | We experimented with PALBERT and PRo BERTa on the GLUE Benchmark datasets (Wang et al., 2018). |
| Dataset Splits | Yes | For evaluation, we performed a grid hyperparameter search on an appropriate metric score on the dev split for each dataset. We trained each model 5 times with the best hyperparameters and reported the mean and std values. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'BERT/ALBERT/RoBERTa' but does not specify version numbers for any software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or Python. |
| Experiment Setup | Yes | For evaluation, we performed a grid hyperparameter search on an appropriate metric score on the dev split for each dataset. ... We used Adam optimizer (Kingma and Ba, 2015) for all experiments, a fixed q = 0.5 on models with the Q-exit criterion, as well as a fixed classifier dropout value equal to 0.1 (Srivastava et al., 2014), and λ = 0.1. ... Table 4: Hyperparameter search ranges used in all of our experiments. Learning rate [1e-5, 2e-5, 3e-5, 5e-5] Batch size [16, 32, 128] Lambda learning rate [1e-5, 2e-5, 3e-5] β [0.5] |