Policy Learning for Balancing Short-Term and Long-Term Rewards

Authors: Peng Wu, Ziyu Shen, Feng Xie, Wang Zhongyao, Chunchen Liu, Yan Zeng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to validate the effectiveness of the proposed method, demonstrating its practical applicability.
Researcher Affiliation Collaboration 1School of Mathematics and Statistics, Beijing Technology and Business University 2Ling Yang, Alibaba Group, Hangzhou, China.
Pseudocode No The paper describes its method in prose, outlining steps for policy evaluation and learning, and provides mathematical formulations. However, it does not include a formalized pseudocode block or an algorithm figure.
Open Source Code Yes 1Please note that our code is available at https://github. com/Yana Zeng/Short_long_term-Rewards.
Open Datasets Yes We perform extensive experiments on three widely used benchmark datasets, IHDP (Hill, 2011), JOBS (La Londe, 1986), and PRODUCT (Gao et al., 2022).
Dataset Splits No The paper mentions running '50 independent trials' and describes how data is simulated, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts for each partition).
Hardware Specification No The paper describes its experimental setup but does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used to conduct the experiments.
Software Dependencies No The paper mentions the use of 'machine learning methods' and the 'Adam optimization method' but does not specify any software names with version numbers for libraries, frameworks, or programming languages (e.g., Python version, PyTorch/TensorFlow version).
Experiment Setup Yes We simulate the potential short-term outcomes as follows: Si(0) Bern(σ(w0Xi + ϵ0,i)), Si(1) Bern(σ(w1Xi + ϵ1,i))... We set µ0 = 1, µ1 = 3 and σ0 = σ1 = 1 for IHDP dataset... Yt,i(0) N(β0Xi, 1) + C Xt 1 j=0 Yj,i(0)... where C = 0.02 is a scaling factor... we set the initial value at time step 0 as Y0,i(0) = Si(0), Y0,i(1) = Si(1), then generate Yt,i(0), Yt,i(1) following Eq.(7), and we eventually regard the outcome at the last time step T as the long-term reward, Yi(0) = YT,i(0), Yi(1) = YT,i(1). ... we fix the missing ratio of outcomes Y to be 0.1 and the number of time steps is T = 10. For ease of comparison, we transform the optimization problem into arg maxπ Π(1 λ)ˆV(π; s) + λˆV(π; y), where λ is a balance factor between short and long-term rewards. ... we compare three different optimization strategies: NAIVE-S (λ = 0), NAIVE-Y (λ = 1), and Ours (Balanced, λ = 0.5).