The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Authors: Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, Wen Sun

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now compare DISTCB with Square CB [Foster and Rakhlin, 2020] and the state-of-the-art CB method Fast CB [Foster and Krishnamurthy, 2021]... We consider three challenging tasks that are all derived from real-world datasets... Table 1: Avg cost over all episodes and last 100 episodes (lower is better). We report mean (sem) over 10 seeds. Reproducible code is available at https://github.com/kevinzhou497/distcb.
Researcher Affiliation Academia Kaiwen Wang Kevin Zhou Runzhe Wu Nathan Kallus Wen Sun Cornell University {kw437,klz23,rw646,kallus,ws455}@cornell.edu
Pseudocode Yes Algorithm 1 Distributional CB (DISTCB), Algorithm 2 Optimistic Distributional Confidence set Optimization (O-DISCO), Algorithm 3 Pessimistic Distributional Confidence set Optimization (P-DISCO).
Open Source Code Yes Reproducible code is available at https://github.com/kevinzhou497/distcb.
Open Datasets Yes King County Housing... Prudential Life Insurance... CIFAR-100 This popular image dataset contains 100 classes... [Krizhevsky, 2009]... [Montoya et al., 2015]... [Vanschoren et al., 2013]... Table 3: Overview of the three datasets and their experimental setups
Dataset Splits No The paper does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets). It describes the number of episodes and batch sizes for online learning.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory specifications, or cloud instances) used for running its experiments.
Software Dependencies No The paper mentions 'Py Torch' and 'Wand B (Weights and Biases) library' but does not provide specific version numbers for these software components.
Experiment Setup Yes our γ learning rate at each time step t is set to γt = γ0tp where γ0 and p are hyperparameters. We use batch sizes of 32 samples per episode. the King County and Prudential experiments run for 5, 000 episodes while the CIFAR-100 experiment runs for 15, 000. For our regression oracles, we use Res Net18... and a simple 2 hidden-layer neural network...