reproducibilityindex.ai

The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Authors: Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, Wen Sun

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now compare DISTCB with Square CB [Foster and Rakhlin, 2020] and the state-of-the-art CB method Fast CB [Foster and Krishnamurthy, 2021]... We consider three challenging tasks that are all derived from real-world datasets... Table 1: Avg cost over all episodes and last 100 episodes (lower is better). We report mean (sem) over 10 seeds. Reproducible code is available at https://github.com/kevinzhou497/distcb.
Researcher Affiliation	Academia	Kaiwen Wang Kevin Zhou Runzhe Wu Nathan Kallus Wen Sun Cornell University {kw437,klz23,rw646,kallus,ws455}@cornell.edu
Pseudocode	Yes	Algorithm 1 Distributional CB (DISTCB), Algorithm 2 Optimistic Distributional Confidence set Optimization (O-DISCO), Algorithm 3 Pessimistic Distributional Confidence set Optimization (P-DISCO).
Open Source Code	Yes	Reproducible code is available at https://github.com/kevinzhou497/distcb.
Open Datasets	Yes	King County Housing... Prudential Life Insurance... CIFAR-100 This popular image dataset contains 100 classes... [Krizhevsky, 2009]... [Montoya et al., 2015]... [Vanschoren et al., 2013]... Table 3: Overview of the three datasets and their experimental setups
Dataset Splits	No	The paper does not specify dataset splits (e.g., percentages or counts for training, validation, or test sets). It describes the number of episodes and batch sizes for online learning.
Hardware Specification	No	The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory specifications, or cloud instances) used for running its experiments.
Software Dependencies	No	The paper mentions 'Py Torch' and 'Wand B (Weights and Biases) library' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	our γ learning rate at each time step t is set to γt = γ0tp where γ0 and p are hyperparameters. We use batch sizes of 32 samples per episode. the King County and Prudential experiments run for 5, 000 episodes while the CIFAR-100 experiment runs for 15, 000. For our regression oracles, we use Res Net18... and a simple 2 hidden-layer neural network...