Hierarchical Reinforcement Learning for Integrated Recommendation

Authors: Ruobing Xie, Shaoliang Zhang, Rui Wang, Feng Xia, Leyu Lin4521-4528

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we conduct extensive offline and online experiments on a billion-level real-world dataset to show the effectiveness of HRL-Rec. HRL-Rec has also been deployed on We Chat Top Stories, affecting millions of users.
Researcher Affiliation Industry Ruobing Xie*, Shaoliang Zhang , Rui Wang, Feng Xia, Leyu Lin We Chat Search Application Department, Tencent, China ruobingxie@tencent.com
Pseudocode No The paper describes its methods in prose and with diagrams (e.g., Fig 2), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes The source codes are released in https://github.com/modriczhang/HRL-Rec.
Open Datasets No The paper states: 'we build a new dataset IRec-4B from a real-world integrated recommendation system named We Chat Top Stories.' and 'All data are preprocessed via data masking to protect user privacy.' However, it does not provide any concrete access information (link, DOI, repository, or citation) for this dataset, and the privacy masking suggests it is not publicly available.
Dataset Splits No The paper states: 'We split these instances into a train set and a test set using the chronological order.' It mentions a train and test set, but does not explicitly describe a separate validation split or how it was used.
Hardware Specification No The paper does not specify the hardware used for the experiments, such as particular GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper mentions using 'Adam for optimization' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes HRL-Rec takes top 200 items in each channel as inputs and output top 10 heterogeneous items. The maximum length of input sequence is 50 for both agents. The dimensions of the aggregated feature embeddings and the item/channel embeddings are 128 and 32. We utilize a 4-head self-attention, and set the discount factor as γ = 0.3. In training, we use Adam for optimization with the batch size set as 256. We conduct a grid search for parameter selection. All models share the same features and experimental settings. We empirically set the loss weights λl : λh : λc : λs = 5 : 5 : 1 : 1 according to model performances. For fast and stable convergence, we update Critics with a higher frequency (usually 5 to 30 times) than Actors, and also conduct a distributed learning with asynchronous gradient optimization.