Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures
Authors: Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment (DFA) to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. 3 Experiments We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architectures. |
| Researcher Affiliation | Collaboration | Julien Launay1,2 Iacopo Poli1 François Boniface1 Florent Krzakala1,2,3 1Light On 2LPENS, École Normale Supérieure 3 Ide Phics, EPFL {julien, iacopo, francois, florent}@lighton.ai |
| Pseudocode | No | The paper describes the forward and backward passes using mathematical equations and prose, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code is available on the paper website at lair.lighton.ai/dfa-scales. |
| Open Datasets | Yes | We evaluate these methods on the Criteo dataset [48], which features nearly 46 million samples of one million sparse features. We evaluate performance on three citation network datasets: Cora, Cite Seer, and Pub Med [65]. ... We train a Transformer to predict the next word on the Wiki Text-103 dataset [81], a large collection of good and featured Wikipedia articles. |
| Dataset Splits | No | The paper mentions "validation perplexity" for the Transformer experiments but does not provide specific percentages or absolute counts for training, validation, or test dataset splits for any of the experiments described. |
| Hardware Specification | No | The paper mentions using "substantial cloud compute resources, with state-of-the-art GPU hardware," but it does not provide specific models or configurations for the GPUs, CPUs, or other hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions using "Py Torch Geometric [64]" and "Adam [83]" but does not provide specific version numbers for these or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | Hyper-parameters fine-tuned for BP did not fare well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably. The learning rate schedule used on top of Adam [83] in [63] proved detrimental. Using Adam alone required reducing the learning rate between BP and DFA. Increasing β2 from 0.98 [63] to 0.999 improved performance significantly. Finally, a simple scheduler that reduces the learning rate when the validation perplexity plateaus helped reducing it further. With the scheduler, the initial learning rate is 1.10 4 and it is multiplied by 0.2 when performance plateaus, with a patience of 1. |