EM-Network: Oracle Guided Self-distillation for Sequence Learning

Authors: Ji Won Yoon, Sunghwan Ahn, Hyeonseung Lee, Minchan Kim, Seok Min Kim, Nam Soo Kim

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on two types of seq2seq models: connectionist temporal classification (CTC) for speech recognition and attention-based encoder-decoder (AED) for machine translation. Experimental results demonstrate that the EM-Network significantly advances the current state-of-the-art approaches, improving over the best prior work on speech recognition and establishing state-of-the-art performance on WMT 14 and IWSLT 14.
Researcher Affiliation Academia 1Department of ECE and INMC, Seoul National University, Seoul, Republic of Korea. Correspondence to: Ji Won Yoon <jwyoon@hi.snu.ac.kr>, Nam Soo Kim <nkim@snu.ac.kr>.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Source code can be found at https://github.com/Yoon-Project/EM-Network-official.
Open Datasets Yes We trained the EM-Network with the full 960 hours of Libri Speech (LS-960) (Panayotov et al., 2015)... We evaluated the EM-Network on IWSLT 14 and WMT 14 datasets for English-to-German (En-De) and German-to-English (De-En) translation tasks.
Dataset Splits Yes Regarding the selection of the best checkpoint, we selected one best checkpoint based upon the validation loss, following the Bi BERT (Xu et al., 2021). For the IWSLT 14 dataset, we used four Titan V GPUs (each with 12GB of memory)... In the case of the WMT 14 dataset, newstest2012 and newstest2013 were combined as the validation set, and we used newstest2014 as the test set.
Hardware Specification Yes For the fully-supervised setting, we used four Quadro RTX 8000 GPUs (each with 48GB of memory)... For the IWSLT 14 dataset, we used four Titan V GPUs (each with 12GB of memory)... We utilized four Quadro RTX 8000 GPUs (each with 48GB of memory)...
Software Dependencies No The paper mentions several toolkits and optimizers (e.g., Ne Mo, fairseq, ESPNet, Adam W, Adam) but does not provide specific version numbers for any of them.
Experiment Setup Yes The tunable parameter α was experimentally set to 2. Adam W algorithm (Loshchilov & Hutter, 2019) was employed as an optimizer with an initial learning rate of 5.0. The number of frequency and time masks were set to 2 and 5, respectively. The width of the frequency and time were set to 27 and 0.05. During the fine-tuning, the parameter α and the masking ratio λ were set to 2 and 50 %, respectively.