Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Authors: Yan Li, Dhruv Choudhary, Xiaohan Wei, Baichuan Yuan, Bhargav Bhushanam, Tuo Zhao, Guanghui Lan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms, while using significantly lower memory.
Researcher Affiliation Collaboration Yan Li ISyE, Georgia Tech yli939@gatech.edu Dhruv Choudhary Meta choudharydhruv@fb.com Xiaohan Wei Meta ubimeteor@fb.com Baichuan Yuan Meta bcyuan@fb.com Bhargav Bhushanam Meta bbhushanam@fb.com Tuo Zhao ISyE, Georgia Tech tourzhao@gatech.edu Guanghui Lan ISyE, Georgia Tech george.lan@isye.gatech.edu
Pseudocode Yes Algorithm 1 Frequency-aware Stochastic Gradient Descent
Open Source Code No The paper states "We build upon torchfm2, which contains implementation of various popular recommendation models." in Section A.1, but does not provide a specific link or statement about making their own implementation's source code publicly available.
Open Datasets Yes Datasets: Movie Lens-1M (Group Lens, 2003) and Criteo 1TB Click Logs dataset (Criteo, 2014).
Dataset Splits Yes For both Movielens-1M and Criteo dataset, we random split into training set, validation set and test set, taking up 80%, 10%, and 10% of the total samples respectively.
Hardware Specification No The paper mentions training an "ultra-large industrial recommendation model" and discusses its "size over multiple terabytes" and "memory footprint", but it does not specify any exact hardware components like GPU or CPU models, memory sizes, or specific cloud instances used for the experiments.
Software Dependencies No The paper states "We build upon torchfm2" and implies the use of deep learning frameworks, but it does not provide specific version numbers for any software dependencies (e.g., Python version, PyTorch version, or torchfm2 version).
Experiment Setup Yes To ensure a fair comparison, for each dataset and model type, we carefully tune the learning rate of each algorithm for best performance. We apply early stopping and stop training whenever the validation AUC do not increase for 2 consecutive epochs, which is widely adopted in practice (Takacs et al., 2009; Dacrema et al., 2021). All the algorithms use 1024 as the batch size during training. Tables 2 and 3 list learning rates for Movielens-1M and Criteo datasets, respectively.