Goal-Directed Planning via Hindsight Experience Replay

Authors: Lorenzo Moro, Amarildo Likmeta, Enrico Prati, Marcello Restelli

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of the proposed approach through an extensive empirical evaluation in several simulated domains, including a novel application to a quantum compiling domain.
Researcher Affiliation Academia 1DEIB, Politecnico di Milano, Milan, Italy 2CNR-IFN, Milan, Italy 3FABIT, Universita di Bologna, Bologna, Italy
Pseudocode Yes Algorithm 1: Alpha Zero HER Initialize memory buffer B Initialize policy πθ and value network vθ for epoch = 1, , N do for episode = 1, , M do experiences {} st µ // Sample initial state while not done do pt, at MCTS(st, πθ, vθ) st+1, rt, done apply Action(at) experiences experiences S (st, pt, rt) st st+1 end Store every experience (st, pt, zt) in B, where zt = PT i=t γi tri for t in episode experiences do // Generate new experiences G Sample k goals from future visited states sj where j > t for g in G do rg t r(st, at, g) end Store every (st, pt, zg t ) in B, where zg t = PT i=t γi trg i end update πθ, vθ according to Equation 4 end end
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No The paper describes custom simulated environments (Bit Flip, 2D Navigation, 2D Maze, Quantum Compiler) where data is generated through interaction. It does not refer to or provide access to a specific publicly available dataset used for training.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, or testing, as it operates within dynamically interacting environments rather than fixed datasets.
Hardware Specification No The paper states 'We ran each experiment in a single multi-core machine, with no GPUs.' This is not specific enough to identify exact CPU models, processor types, or memory details.
Software Dependencies No The paper mentions 'stable-baselines' and 'hyperopt' but does not specify their version numbers. No other software dependencies are mentioned with version numbers.
Experiment Setup Yes In this section, we provide the hyper-parameters employed in the experiments presented in this work. Table 1 and Table 2 provide a list of hyperparameters employed for both Alpha Zero and Alpha Zero HER, without being optimized. Table 1: Hyperparameter Value Optimizer Adam cuct 2.0 Discount factor 0.999 Episodes per epoch 50. Table 2: Hyperparameter Environment Value Learning rate Bit Flip 0.0005 2D Navigation 0.001 2D Maze 0.0005 Quantum Compiling 0.00005 Batch size Bit Flip 256 2D Navigation 512 2D Maze 512 Quantum Compiling 512 Search Iterations Bit Flip 20 2D Navigation 70 2D Maze 120 Quantum Compiling 20