Solving the Rubik's Cube with Approximate Policy Iteration
Authors: Stephen McAleer, Forest Agostinelli, Alexander Shmakov, Pierre Baldi
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves less than or equal to solvers that employ human domain knowledge. Our algorithm, called Autodidactic Iteration (ADI), trains a neural network value and policy function through an iterative process. These neural networks are the "fast policy" of DPI described earlier. After the network is trained, it is combined with MCTS to effectively solve the Rubik s Cube. We call the resulting solver Deep Cube. |
| Researcher Affiliation | Academia | Stephen Mc Aleer Department of Statistics University of California, Irvine smcaleer@uci.edu Forest Agostinelli Department of Computer Science University of California, Irvine fagostin@uci.edu Alexander Shmakov Department of Computer Science University of California, Irvine ashmakov@uci.edu Pierre Baldi Department of Computer Science University of California, Irvine pfbaldi@ics.uci.edu |
| Pseudocode | Yes | Algorithm 1: Autodidactic Iteration |
| Open Source Code | No | The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper. |
| Open Datasets | No | The paper generates its own training data by starting from the solved state and scrambling the cube, rather than using a pre-existing, publicly accessible dataset with concrete access information. |
| Dataset Splits | No | The paper mentions 'training samples' and evaluating on 'randomly scrambled cubes' but does not specify exact dataset splits (percentages or counts) for training, validation, or testing. |
| Hardware Specification | Yes | Our training machine was a 32-core Intel Xeon E5-2620 server with three NVIDIA Titan XP GPUs. |
| Software Dependencies | No | The paper mentions the use of the RMSProp optimizer and a feed forward network, but it does not specify versions for any key software libraries, frameworks, or dependencies (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | No | The paper mentions general training details such as using the RMSProp optimizer, mean squared error loss, softmax cross entropy loss, and the number of iterations (2,000,000), along with mentioning exploration (c) and virtual loss (ν) hyperparameters, but it does not provide specific numerical values for these hyperparameters or other system-level training settings. |