reproducibilityindex.ai

Hierarchical Reinforcement Learning for Integrated Recommendation

Authors: Ruobing Xie, Shaoliang Zhang, Rui Wang, Feng Xia, Leyu Lin4521-4528

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we conduct extensive ofﬂine and online experiments on a billion-level real-world dataset to show the effectiveness of HRL-Rec. HRL-Rec has also been deployed on We Chat Top Stories, affecting millions of users.
Researcher Affiliation	Industry	Ruobing Xie*, Shaoliang Zhang , Rui Wang, Feng Xia, Leyu Lin We Chat Search Application Department, Tencent, China ruobingxie@tencent.com
Pseudocode	No	The paper describes its methods in prose and with diagrams (e.g., Fig 2), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The source codes are released in https://github.com/modriczhang/HRL-Rec.
Open Datasets	No	The paper states: 'we build a new dataset IRec-4B from a real-world integrated recommendation system named We Chat Top Stories.' and 'All data are preprocessed via data masking to protect user privacy.' However, it does not provide any concrete access information (link, DOI, repository, or citation) for this dataset, and the privacy masking suggests it is not publicly available.
Dataset Splits	No	The paper states: 'We split these instances into a train set and a test set using the chronological order.' It mentions a train and test set, but does not explicitly describe a separate validation split or how it was used.
Hardware Specification	No	The paper does not specify the hardware used for the experiments, such as particular GPU models, CPU types, or cloud computing instances.
Software Dependencies	No	The paper mentions using 'Adam for optimization' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	HRL-Rec takes top 200 items in each channel as inputs and output top 10 heterogeneous items. The maximum length of input sequence is 50 for both agents. The dimensions of the aggregated feature embeddings and the item/channel embeddings are 128 and 32. We utilize a 4-head self-attention, and set the discount factor as γ = 0.3. In training, we use Adam for optimization with the batch size set as 256. We conduct a grid search for parameter selection. All models share the same features and experimental settings. We empirically set the loss weights λl : λh : λc : λs = 5 : 5 : 1 : 1 according to model performances. For fast and stable convergence, we update Critics with a higher frequency (usually 5 to 30 times) than Actors, and also conduct a distributed learning with asynchronous gradient optimization.