Pseudo-label Training and Model Inertia in Neural Machine Translation

Authors: Benjamin Hsu, Anna Currey, Xing Niu, Maria Nadejde, Georgiana Dinu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments across 6 language pairs (LPs): English (en) German (de), Russian (ru), and Japanese (ja). We adapt the Transformer-base architecture (Vaswani et al., 2017) to 20 encoder layers and 2 decoder layers (denoted 20:2) as recommended by Domhan et al. (2020) and SSRU decoder layers for faster decoding (Kim et al., 2019). Table 2: Training data sizes and performance scores for PLT/Baseline models.
Researcher Affiliation Industry Benjamin Hsu, Anna Currey, Xing Niu, Maria N adejde and Georgiana Dinu AWS AI Labs benhsu@amazon.com
Pseudocode No The paper describes the training process and methods verbally and through mathematical equations (e.g., Equation 1) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using "Sockeye 3" as a toolkit but does not provide any explicit statements about releasing their own source code or links to a repository for the methodology described.
Open Datasets Yes Experiments are carried out with the WMT21 dataset (Akhbardeh et al., 2021). For en de we use 286M parallel segments, for en ja we use 17.2M parallel segments, and for en ru we use 34M parallel segments. We trained our models on a subset of datasets from the WMT21 news task. Specifically, we used Paracrawl v9 (Ba n on et al., 2020), Wiki Matrix (Schwenk et al., 2021), Wiki Titles (Bojar et al., 2018), news commentary, UN v1.0 dataset (Ziemski et al., 2016), JPara Crawl (Morishita et al., 2020) and the Japanese-English subtitles datasets (Pryzant et al., 2018).
Dataset Splits Yes For development, we use WMT newstest datasets from earlier years (see Appendix B for more details on datasets used). We used the WMT news test datasets from previous years as our development set.
Hardware Specification No The paper states "Training is done on 8 GPUs with Sockeye 3 s large batch training." However, it does not specify the exact model or type of GPUs used (e.g., NVIDIA V100, A100), or any CPU/memory details.
Software Dependencies Yes Training is done on 8 GPUs with Sockeye 3 s large batch training.
Experiment Setup Yes All models used in our experiments utilized the following set of hyperparameters. Training and development data was tokenized using the Sacremoses tokenizer. Words were segmented using BPE (Sennrich et al., 2016b) with 32K operations. ... 'learning_rate_scheduler_type': 'inv-sqrt-decay', 'keep_last_params': 10, 'update_interval': 16, 'transformer_model_size': (512, 512), 'transformer_postprocess': ('dr', 'dr'), 'learning_rate_warmup': 2000, 'transformer_dropout_act': (0.1, 0.1), 'transformer_feed_forward_num_hidden': (2048, 2048), 'max_num_checkpoint_not_improved': 60, 'weight_init_xavier_factor_type': 'avg', 'optimized_metric': 'perplexity', 'cache_strategy': 'best', 'num_layers': (20, 2), 'use_cpu': False, 'checkpoint_improvement_threshold': 0.001, 'device_ids': [-1], 'learning_rate_reduce_num_not_improved': 8, 'initial_learning_rate': 0.06325, 'seed': 1, 'cache_metric': 'perplexity', 'gradient_clipping_type': 'abs', 'cache_last_best_params': 8, 'weight_init_scale': 3.0, 'dtype': 'float32', 'decode_and_evaluate': 500, 'max_seconds': 1036800, 'amp': True, 'keep_initializations': True, 'transformer_dropout_prepost': (0.1, 0.1), 'transformer_attention_heads': (8, 8), 'weight_tying_type': 'src_trg_softmax', 'learning_rate_reduce_factor': 0.9, 'loss': 'cross-entropy', 'horovod': True, 'num_embed': (512, 512), 'embed_dropout': (0.0, 0.0), 'transformer_preprocess': ('n', 'n'), 'encoder': 'transformer', 'loglevel_secondary_workers': 'ERROR', 'label_smoothing': 0.1, 'batch_size': 2500, 'learning_rate_t_scale': 1.0, 'batch_type': 'max-word', 'optimizer': 'adam', 'transformer_dropout_attention': (0.1, 0.1), 'decoder': 'ssru_transformer', 'min_num_epochs': 1, 'checkpoint_interval': 500, 'transformer_positional_embedding_type': 'fixed', 'lock_dir': '/data', 'gradient_clipping_threshold': -1.0, 'weight_init': 'xavier', 'no_hybridization': False, 'batch_sentences_multiple_of': 8, 'transformer_activation_type': ('relu', 'relu')