Learning to Make Adherence-aware Advice
Authors: Guanting Chen, Xiaocheng Li, Chunlin Sun, Hanzhao Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 NUMERICAL EXPERIMENT We perform numerical experiments under two environments: Flappy Bird (Williams et al., 2023) and Car Driving Meresht et al. (2020). Both Atari game-like environments are suitable and convenient for modeling human behavior while retaining the learning structure for the machine. We focus on the flappy bird environment here and defer the car driving environment to Appendix B. Figure 2: The regrets for learning the optimal advice for Policy Greedy and Policy Safe. Figure 2a, 2b show the regrets of RFE-AD, UCB-AD, and EULER for two policies respectively. Figure 2c shows the regrets of UCB-AD for two policies under different θ s. Figure 3: The performances of making pertinent advice. The value gap is defined as the difference between the value of current policy and the optimal values, with the red dashed line as the benchmark for 0 loss of the policy. Figure 3a shows the convergence of RFE-β under difference β s. Figure 3b compares the convergences of RFE-CMDP and UC-CFH. Figure 3c evaluates performance of policy learned from learning episodes in Figure 3b. |
| Researcher Affiliation | Academia | Guanting Chen1, Xiaocheng Li2, Chunlin Sun3, Hanzhao Wang2 1 Department of Statistics and Operations Research, UNC-Chapel Hill 2 Imperial College Business School, Imperial College London 3 Institute for Computational and Mathematical Engineering, Stanford University |
| Pseudocode | Yes | Algorithm 1 UCB-ADherence (UCB-AD) ... Algorithm 2 : RFE-β |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper uses 'Flappy Bird' and refers to 'Car Driving Meresht et al. (2020)' as environments for numerical experiments. It describes the environment setup but does not provide specific access information (link, DOI, specific citation to a dataset resource) for publicly available datasets used for training. |
| Dataset Splits | No | The paper does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, or test sets). It describes the game environments and human policies but not data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software components with their version numbers. |
| Experiment Setup | Yes | We define the state space as the current locations of the bird on the grid, represented by coordinates (x, y) Z2, with a total of 7 20 = 140 states. Regarding the action space, we define it as A = {Up, Up-Up, Down}. Each action causes the bird to move forward by one cell. In addition, the Up action moves the bird one cell upwards, the Up-Up action moves it two cells upwards, and the Down action moves it one cell downwards. The MDP has a reward as a function of state only. We will get a reward of 1 when the current state (location) has a star and otherwise 0. To model human behavior, we consider two sub-optimal human policies: Policy Greedy, which prioritizes collecting stars in the next column, and Policy Safe, which focuses on avoiding walls in the next column. If there is no preferred action available, both policies maintain a horizontal zig-zag line by alternating between Up and Down . For adherence level θ, we assume for all s S and h = 1, ..., H, the human will adhere to the advice with probability 0.9 except the aggressive advice Up-up (which moves too fast vertically) with adherence level 0.7. |