reproducibilityindex.ai

Pseudo-label Training and Model Inertia in Neural Machine Translation

Authors: Benjamin Hsu, Anna Currey, Xing Niu, Maria Nadejde, Georgiana Dinu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments across 6 language pairs (LPs): English (en) German (de), Russian (ru), and Japanese (ja). We adapt the Transformer-base architecture (Vaswani et al., 2017) to 20 encoder layers and 2 decoder layers (denoted 20:2) as recommended by Domhan et al. (2020) and SSRU decoder layers for faster decoding (Kim et al., 2019). Table 2: Training data sizes and performance scores for PLT/Baseline models.
Researcher Affiliation	Industry	Benjamin Hsu, Anna Currey, Xing Niu, Maria N adejde and Georgiana Dinu AWS AI Labs benhsu@amazon.com
Pseudocode	No	The paper describes the training process and methods verbally and through mathematical equations (e.g., Equation 1) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using "Sockeye 3" as a toolkit but does not provide any explicit statements about releasing their own source code or links to a repository for the methodology described.
Open Datasets	Yes	Experiments are carried out with the WMT21 dataset (Akhbardeh et al., 2021). For en de we use 286M parallel segments, for en ja we use 17.2M parallel segments, and for en ru we use 34M parallel segments. We trained our models on a subset of datasets from the WMT21 news task. Speciﬁcally, we used Paracrawl v9 (Ba n on et al., 2020), Wiki Matrix (Schwenk et al., 2021), Wiki Titles (Bojar et al., 2018), news commentary, UN v1.0 dataset (Ziemski et al., 2016), JPara Crawl (Morishita et al., 2020) and the Japanese-English subtitles datasets (Pryzant et al., 2018).
Dataset Splits	Yes	For development, we use WMT newstest datasets from earlier years (see Appendix B for more details on datasets used). We used the WMT news test datasets from previous years as our development set.
Hardware Specification	No	The paper states "Training is done on 8 GPUs with Sockeye 3 s large batch training." However, it does not specify the exact model or type of GPUs used (e.g., NVIDIA V100, A100), or any CPU/memory details.
Software Dependencies	Yes	Training is done on 8 GPUs with Sockeye 3 s large batch training.
Experiment Setup	Yes	All models used in our experiments utilized the following set of hyperparameters. Training and development data was tokenized using the Sacremoses tokenizer. Words were segmented using BPE (Sennrich et al., 2016b) with 32K operations. ... 'learning_rate_scheduler_type': 'inv-sqrt-decay', 'keep_last_params': 10, 'update_interval': 16, 'transformer_model_size': (512, 512), 'transformer_postprocess': ('dr', 'dr'), 'learning_rate_warmup': 2000, 'transformer_dropout_act': (0.1, 0.1), 'transformer_feed_forward_num_hidden': (2048, 2048), 'max_num_checkpoint_not_improved': 60, 'weight_init_xavier_factor_type': 'avg', 'optimized_metric': 'perplexity', 'cache_strategy': 'best', 'num_layers': (20, 2), 'use_cpu': False, 'checkpoint_improvement_threshold': 0.001, 'device_ids': [-1], 'learning_rate_reduce_num_not_improved': 8, 'initial_learning_rate': 0.06325, 'seed': 1, 'cache_metric': 'perplexity', 'gradient_clipping_type': 'abs', 'cache_last_best_params': 8, 'weight_init_scale': 3.0, 'dtype': 'float32', 'decode_and_evaluate': 500, 'max_seconds': 1036800, 'amp': True, 'keep_initializations': True, 'transformer_dropout_prepost': (0.1, 0.1), 'transformer_attention_heads': (8, 8), 'weight_tying_type': 'src_trg_softmax', 'learning_rate_reduce_factor': 0.9, 'loss': 'cross-entropy', 'horovod': True, 'num_embed': (512, 512), 'embed_dropout': (0.0, 0.0), 'transformer_preprocess': ('n', 'n'), 'encoder': 'transformer', 'loglevel_secondary_workers': 'ERROR', 'label_smoothing': 0.1, 'batch_size': 2500, 'learning_rate_t_scale': 1.0, 'batch_type': 'max-word', 'optimizer': 'adam', 'transformer_dropout_attention': (0.1, 0.1), 'decoder': 'ssru_transformer', 'min_num_epochs': 1, 'checkpoint_interval': 500, 'transformer_positional_embedding_type': 'fixed', 'lock_dir': '/data', 'gradient_clipping_threshold': -1.0, 'weight_init': 'xavier', 'no_hybridization': False, 'batch_sentences_multiple_of': 8, 'transformer_activation_type': ('relu', 'relu')