Large scale distributed neural network training through online distillation

Authors: Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support our claims using experiments on the Criteo Display Ad Challenge dataset, Image Net, and the largest to-date dataset used for neural language modeling, containing 6 1011 tokens and based on the Common Crawl repository of web data.
Researcher Affiliation Industry Rohan Anil Google rohananil@google.com Gabriel Pereyra Google Deep Mind pereyra@google.com Alexandre Passos Google Brain apassos@google.com Robert Ormandi Google ormandi@google.com George E. Dahl Google Brain gdahl@google.com Geoffrey E. Hinton Google Brain geoffhinton@google.com
Pseudocode Yes Algorithm 1 presents the codistillation algorithm.
Open Source Code No The paper states 'We plan to release the list of document ids that remained after filtering as well as code for our invertible tokenizer for others to use this data set.' This refers to data preprocessing tools, not the open-source code for the codistillation methodology itself. No explicit statement about releasing the main research code was found.
Open Datasets Yes We support our claims using experiments on the Criteo Display Ad Challenge dataset, Image Net, and the largest to-date dataset used for neural language modeling, containing 6 1011 tokens and based on the Common Crawl repository of web data. [...] Common Crawl is an open repository of web crawl data. We downloaded the WET files2 [...] Image Net is the most popular image classification benchmark of recent years. [...] Criteo Display Ad Challenge dataset4 (Criteo) is a benchmark dataset for predicting click through rate for online advertising.
Dataset Splits Yes Figure 1a plots the validation error as a function of global steps for the different numbers of workers we tried, using the best learning rate for each number of workers. [...] Figure 1b plots validation error against wall time for the same varying numbers of synchronous workers.
Hardware Specification No The paper mentions the number of 'GPU workers' used (e.g., '100 GPU workers', '128 GPUs', '256 GPUs') and discusses infrastructure limitations, but it does not specify the exact model of the GPUs (e.g., NVIDIA V100, A100) or other hardware components like CPU models or memory.
Software Dependencies No The paper mentions using 'ADAM optimizer Kingma & Ba (2014)' and 'Adagrad optimizer', but it does not provide specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') for the overall experimental setup.
Experiment Setup Yes We trained language models on Common Crawl with fully synchronous SGD using a per-worker batch size of 128 and 32, 64, 128, and 256 workers. Thus the effective batch size ranged from 4096 to 32768. Generally we should expect to need to increase the learning rate as we increase the effective batch size, so for each number of workers we tried learning rates of 0.1, 0.2, and 0.4. For 32 and 64 workers, 0.1 performed best and since none of the original three learning rates performed well for 256 workers, we also tried an additional intermediate learning rate of 0.3 which was the best performing learning rate for 256 workers. [...] We used the ADAM optimizer Kingma & Ba (2014) for all experiments on Common Crawl. [...] We used the Adagrad optimizer with learning rate of 0.001 for training for all experiments on this dataset [Criteo].