Revisiting Bellman Errors for Offline Model Selection

Authors: Joshua P Zitovsky, Daniel De Marchi, Rishabh Agarwal, Michael Rene Kosorok

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our estimator obtains impressive OMS performance on diverse discrete control tasks, including Atari games. 5. Empirical Results
Researcher Affiliation Collaboration 1Department of Biostatistics, UNC Chapel Hill, North Carolina USA 2Google Deep Mind 3Mila.
Pseudocode Yes Our algorithm is summarized in Algorithm 1. Algorithm A.1 SBV with Tuned Regression Algorithm. Algorithm A.2 Applying Early Stopping to DQN with SBV.
Open Source Code Yes Finally, we open-source our code at https://github.com/jzitovsky/SBV.
Open Datasets Yes SBV achieves strong performance on diverse tasks ranging from healthcare problems (Klasnja et al., 2015) to Atari games (Bellemare et al., 2013). For the Bicycle control problem, we generated 10 offline datasets...following Ernst et al. (2005). For the m Health control problem, we generated 10 offline datasets...following Luckett et al. (2020). Finally, we evaluated SBV (Algorithm 1) on 12 offline DQNReplay datasets (Agarwal et al., 2020)...
Dataset Splits Yes Randomly partition trajectories in D to training set DT and validation set DV. (Algorithm 1) While P µ is unknown, we can still estimate the expectation in Equation 4 by randomly partitioning 80% of the trajectories present in D into a training set DT and reserving the remaining 20% of trajectories as a validation set DV.
Hardware Specification Yes Atari experiments were conducted using a mix of A100 and V100 GPUs from both our university’s computing cluster and GCP virtual machines. With four A100s and four V100s (or with six A100s)... Non-Atari experiments were conducted using 2.50 GHz Intel CPU cores from our university’s computing cluster.
Software Dependencies Yes Unless otherwise specified, all layers use the default parameters specified by Tensor Flow v2.5.0 (Abadi et al., 2015)... The scripts we wrote to run DQN with these configurations made heavy use of the Dopamine library (Castro et al., 2018).
Experiment Setup Yes We tweaked the learning rate and target update frequency... to 2.5e-5 and 32,000, respectively... (D.1). Optimizer: Adam(learning_rate=2.5e-5, loss=Huber, batch_size=128, target_update_freq=32,000) (Figure D.1). Optimizer: Nadam(learning_rate=5e-4, loss=MSE, batch_size=512, max_epochs=40, mixed_precision=True) (Figure D.2).