reproducibilityindex.ai

PALBERT: Teaching ALBERT to Ponder

Authors: Nikita Balagansky, Daniil Gavrilov

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimented with PALBERT and PRo BERTa on the GLUE Benchmark datasets (Wang et al., 2018). The ablation study showed that PALBERT produced significantly better results than the original Ponder Net architecture adapted for ALBERT fine-tuning.
Researcher Affiliation	Industry	Nikita Balagansky, Daniil Gavrilov Tinkoff n.n.balaganskiy@tinkoff.ai, d.gavrilov@tinkoff.ai
Pseudocode	No	The paper describes the proposed methods in text but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source code: https://github.com/tinkoff-ai/palbert
Open Datasets	Yes	We experimented with PALBERT and PRo BERTa on the GLUE Benchmark datasets (Wang et al., 2018).
Dataset Splits	Yes	For evaluation, we performed a grid hyperparameter search on an appropriate metric score on the dev split for each dataset. We trained each model 5 times with the best hyperparameters and reported the mean and std values.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and 'BERT/ALBERT/RoBERTa' but does not specify version numbers for any software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or Python.
Experiment Setup	Yes	For evaluation, we performed a grid hyperparameter search on an appropriate metric score on the dev split for each dataset. ... We used Adam optimizer (Kingma and Ba, 2015) for all experiments, a fixed q = 0.5 on models with the Q-exit criterion, as well as a fixed classifier dropout value equal to 0.1 (Srivastava et al., 2014), and λ = 0.1. ... Table 4: Hyperparameter search ranges used in all of our experiments. Learning rate [1e-5, 2e-5, 3e-5, 5e-5] Batch size [16, 32, 128] Lambda learning rate [1e-5, 2e-5, 3e-5] β [0.5]