ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Authors: Jia Cheng Hu, Roberto Cavicchioli, Giulia Berardinelli, Alessandro Capotondi

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that we achieve 95.5% of BERT Base performances using only 5M parameters (21.9 fewer parameters) and, most importantly, without the help of any transfer learning techniques.
Researcher Affiliation Academia Jia Cheng Hu, Roberto Cavicchioli, Giulia Berardinelli, Alessandro Capotondi University of Modena and Reggio Emilia via G.Campi 213/b 41125, Modena, Italy jiachenghu@unimore.it, roberto.cavicchioli@unimore.it, giulia.berardinelli@unimore.it, alessandro.capotondi@unimore.it
Pseudocode No The paper describes its methods verbally and mathematically but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code will be available at https://github.com/jchenghu/sharebert.
Open Datasets Yes we perform MLM training on the 2022 English Wikipedia and Book Corpus (Zhu et al. 2015), we use the same sub-word tokenization of BERT (Devlin et al. 2018) in the uncased instance. All models are trained for 23000 steps, batch size of 4000 and fine-tuned on GLUE tasks.
Dataset Splits Yes All models are trained for 23000 steps, batch size of 4000 and fine-tuned on GLUE tasks. Table 2: Performance and the number of parameters comparison between Share BERT variants and BERT evaluated on the GLUE dev set.
Hardware Specification No The paper discusses 'memory-limited devices' as the problem domain but does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions 'FP16 mixed precision is adopted' and refers to BERT's tokenization but does not specify software dependencies like programming languages, machine learning frameworks, or libraries with version numbers.
Experiment Setup Yes All models are trained for 23000 steps, batch size of 4000 and fine-tuned on GLUE tasks. FP16 mixed precision is adopted.