The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
Authors: Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, Wen Sun
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now compare DISTCB with Square CB [Foster and Rakhlin, 2020] and the state-of-the-art CB method Fast CB [Foster and Krishnamurthy, 2021]... We consider three challenging tasks that are all derived from real-world datasets... Table 1: Avg cost over all episodes and last 100 episodes (lower is better). We report mean (sem) over 10 seeds. Reproducible code is available at https://github.com/kevinzhou497/distcb. |
| Researcher Affiliation | Academia | Kaiwen Wang Kevin Zhou Runzhe Wu Nathan Kallus Wen Sun Cornell University {kw437,klz23,rw646,kallus,ws455}@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Distributional CB (DISTCB), Algorithm 2 Optimistic Distributional Confidence set Optimization (O-DISCO), Algorithm 3 Pessimistic Distributional Confidence set Optimization (P-DISCO). |
| Open Source Code | Yes | Reproducible code is available at https://github.com/kevinzhou497/distcb. |
| Open Datasets | Yes | King County Housing... Prudential Life Insurance... CIFAR-100 This popular image dataset contains 100 classes... [Krizhevsky, 2009]... [Montoya et al., 2015]... [Vanschoren et al., 2013]... Table 3: Overview of the three datasets and their experimental setups |
| Dataset Splits | No | The paper does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets). It describes the number of episodes and batch sizes for online learning. |
| Hardware Specification | No | The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory specifications, or cloud instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'Wand B (Weights and Biases) library' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | our γ learning rate at each time step t is set to γt = γ0tp where γ0 and p are hyperparameters. We use batch sizes of 32 samples per episode. the King County and Prudential experiments run for 5, 000 episodes while the CIFAR-100 experiment runs for 15, 000. For our regression oracles, we use Res Net18... and a simple 2 hidden-layer neural network... |