Learning Simultaneous Navigation and Construction in Grid Worlds
Authors: Wenyu Han, Haoran Wu, Eisuke Hirota, Alexander Gao, Lerrel Pinto, Ludovic Righetti, Chen Feng
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that pre-training this position estimation module before Q-learning can significantly improve the construction performance measured by the intersection-over-union score, achieving the best results in our benchmark of various baselines including model-free and model-based RL, a handcrafted SLAM-based policy, and human players. Our code is available at: https://ai4ce.github.io/SNAC/. |
| Researcher Affiliation | Academia | Wenyu Han, Haoran Wu, Eisuke Hirota, Alexander Gao, Lerrel Pinto, Ludovic Righetti, Chen Feng New York University |
| Pseudocode | Yes | Algorithm 1 Handcrafted Policy |
| Open Source Code | Yes | Our code is available at: https://ai4ce.github.io/SNAC/. |
| Open Datasets | No | The paper states that the authors created their own dataset by collecting |
| Dataset Splits | Yes | For the variable design tasks, we randomly generated 500 ground-truth designs and split them to 8/1/1 for training/validation/testing. |
| Hardware Specification | Yes | We test each simulation environment for 500 episodes of games on Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz using a single thread |
| Software Dependencies | No | The paper mentions software components like "Stable Baselines" and various RL algorithms (DQN, DRQN, PPO, Rainbow, SAC) but does not provide specific version numbers for any of these software dependencies or programming languages/frameworks used. |
| Experiment Setup | Yes | To validate the proposed framework and its robustness, all baselines are trained with the same set of 4 random seeds and averaged results are reported. ... For the constant design tasks in 1D/2D/3D, we test the trained agent for 500 times for each task... For the variable design tasks, we randomly generated 500 ground-truth designs and split them to 8/1/1 for training/validation/testing. ... DQN. ... We train DQN on each task for 3,000 episodes. Batch size is 2,000, and replay buffer size 50,000. ... DRQN. We train it for 10,000 episodes with batch size of 64 and replay memory size of 1,000. ... PPO. ... We train PPO for 10 million time steps... we chose the following values: 1e5 for the batch size, 1e2 for the number of minibatches, 2.5e-4 for the learning rate and 0.1 for the clipping threshold. |