Solving the Rubik's Cube with Approximate Policy Iteration

Authors: Stephen McAleer, Forest Agostinelli, Alexander Shmakov, Pierre Baldi

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves less than or equal to solvers that employ human domain knowledge. Our algorithm, called Autodidactic Iteration (ADI), trains a neural network value and policy function through an iterative process. These neural networks are the "fast policy" of DPI described earlier. After the network is trained, it is combined with MCTS to effectively solve the Rubik s Cube. We call the resulting solver Deep Cube.
Researcher Affiliation Academia Stephen Mc Aleer Department of Statistics University of California, Irvine smcaleer@uci.edu Forest Agostinelli Department of Computer Science University of California, Irvine fagostin@uci.edu Alexander Shmakov Department of Computer Science University of California, Irvine ashmakov@uci.edu Pierre Baldi Department of Computer Science University of California, Irvine pfbaldi@ics.uci.edu
Pseudocode Yes Algorithm 1: Autodidactic Iteration
Open Source Code No The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper.
Open Datasets No The paper generates its own training data by starting from the solved state and scrambling the cube, rather than using a pre-existing, publicly accessible dataset with concrete access information.
Dataset Splits No The paper mentions 'training samples' and evaluating on 'randomly scrambled cubes' but does not specify exact dataset splits (percentages or counts) for training, validation, or testing.
Hardware Specification Yes Our training machine was a 32-core Intel Xeon E5-2620 server with three NVIDIA Titan XP GPUs.
Software Dependencies No The paper mentions the use of the RMSProp optimizer and a feed forward network, but it does not specify versions for any key software libraries, frameworks, or dependencies (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup No The paper mentions general training details such as using the RMSProp optimizer, mean squared error loss, softmax cross entropy loss, and the number of iterations (2,000,000), along with mentioning exploration (c) and virtual loss (ν) hyperparameters, but it does not provide specific numerical values for these hyperparameters or other system-level training settings.