Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Reinforcement Learning

Authors: Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, An-Xiang Zeng4902-4909

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, Virtual-Taobao is trained from hundreds of millions of real Taobao customers records. Compared with the real Taobao, Virtual-Taobao faithfully recovers important properties of the real environment. We further show that the policies trained purely in Virtual-Taobao, which has zero physical sampling cost, can have significantly superior real-world performance to the traditional supervised approaches, through online A/B tests.
Researcher Affiliation Collaboration Jing-Cheng Shi,1,2 Yang Yu,1 Qing Da,2 Shi-Yong Chen,1 An-Xiang Zeng2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China {shijc, yuy, chensy}@lamda.nju.edu.cn 2Alibaba Group {jingcheng.sjc, daqing.dq}@alibaba-inc.com, renzhong@taobao.com
Pseudocode Yes Algorithm 1 GAN-SD
Open Source Code No No explicit statement or link providing access to source code for the described methodology was found.
Open Datasets No In experiments, Virtual-Taobao is trained from hundreds of millions of real Taobao customers records.
Dataset Splits No No explicit train/validation/test dataset splits with percentages, counts, or references to predefined standard splits were found.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments were mentioned.
Software Dependencies No No specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8') were mentioned.
Experiment Setup Yes We then retrain the TRPO agent with the ANC strategy in which ρ = 1 and µ = 0.01, and R2P is decreased to 0.115 in Virtual-Taobao which is more acceptable.