Revisiting Bellman Errors for Offline Model Selection
Authors: Joshua P Zitovsky, Daniel De Marchi, Rishabh Agarwal, Michael Rene Kosorok
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our estimator obtains impressive OMS performance on diverse discrete control tasks, including Atari games. 5. Empirical Results |
| Researcher Affiliation | Collaboration | 1Department of Biostatistics, UNC Chapel Hill, North Carolina USA 2Google Deep Mind 3Mila. |
| Pseudocode | Yes | Our algorithm is summarized in Algorithm 1. Algorithm A.1 SBV with Tuned Regression Algorithm. Algorithm A.2 Applying Early Stopping to DQN with SBV. |
| Open Source Code | Yes | Finally, we open-source our code at https://github.com/jzitovsky/SBV. |
| Open Datasets | Yes | SBV achieves strong performance on diverse tasks ranging from healthcare problems (Klasnja et al., 2015) to Atari games (Bellemare et al., 2013). For the Bicycle control problem, we generated 10 offline datasets...following Ernst et al. (2005). For the m Health control problem, we generated 10 offline datasets...following Luckett et al. (2020). Finally, we evaluated SBV (Algorithm 1) on 12 offline DQNReplay datasets (Agarwal et al., 2020)... |
| Dataset Splits | Yes | Randomly partition trajectories in D to training set DT and validation set DV. (Algorithm 1) While P µ is unknown, we can still estimate the expectation in Equation 4 by randomly partitioning 80% of the trajectories present in D into a training set DT and reserving the remaining 20% of trajectories as a validation set DV. |
| Hardware Specification | Yes | Atari experiments were conducted using a mix of A100 and V100 GPUs from both our university’s computing cluster and GCP virtual machines. With four A100s and four V100s (or with six A100s)... Non-Atari experiments were conducted using 2.50 GHz Intel CPU cores from our university’s computing cluster. |
| Software Dependencies | Yes | Unless otherwise specified, all layers use the default parameters specified by Tensor Flow v2.5.0 (Abadi et al., 2015)... The scripts we wrote to run DQN with these configurations made heavy use of the Dopamine library (Castro et al., 2018). |
| Experiment Setup | Yes | We tweaked the learning rate and target update frequency... to 2.5e-5 and 32,000, respectively... (D.1). Optimizer: Adam(learning_rate=2.5e-5, loss=Huber, batch_size=128, target_update_freq=32,000) (Figure D.1). Optimizer: Nadam(learning_rate=5e-4, loss=MSE, batch_size=512, max_epochs=40, mixed_precision=True) (Figure D.2). |