Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AutoCGP: Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations

Authors: Pei Zhou, Ruizhe Liu, Qian Luo, Fan Wang, Yibing Song, Yanchao Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach significantly outperforms baseline methods across a range of tasks and environments, while showcasing emergent consistency in motion patterns associated with the discovered manipulation concepts.
Researcher Affiliation Collaboration 1HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong 2Department of Electrical and Electronic Engineering, The University of Hong Kong 3DAMO Academy, Alibaba Group 4Hupan Lab 5Transcengram
Pseudocode Yes Algorithm 1 Automatic Concept Discovery Input: Demo D = {τ = (sτ t , oτ t , aτ t , )T (τ) t=1 } Modules: E, A = {αk}K k=1 and index K = {k}K k=1 Goal State Detection: G Goal State Evaluation: V(HN ), Π, Goal Consolidation: R Output: trained state encoder ϕ for iteration of training do sample τ D for t = 1, 2, ..., T(τ) do zτ t = E(t|τ; ΘE) kτ t = arg mink K zτ t αk VQ-VAE Encoder Select Manipulation Concepts ατ t = αkτ t ατ t = SG(ατ t zτ t ) + zτ t Preserve Gradient end for for t = 1, 2, ..., T(τ) do Calculate gτ t using Eq. 4 end for Calculate Lgd,Lge a ,Lge c ,Lgc See Eq. 3, Eq. 6, Eq. 7, Eq. 8 Calculate Lvq VQ loss and Commitment loss according to (Van Den Oord et al., 2017) LACD = Lgd + Lge c + λent (Lge a + Lvq) + λgc Lgc Back propagation from LACD end for
Open Source Code Yes Codes are available at: https://github.com/Pei Zhou26/Auto CGP.
Open Datasets Yes We evaluate our method on tabletop manipulation tasks as described in Mimic Gen (Mandlekar et al., 2023), detailed in Sec. C.1.
Dataset Splits No For each task and its corresponding level of variation, we select 950 demonstrations provided by Mimic Gen for Imitation Learning. ... During evaluations, the environment is initialized randomly. We conduct tests over 50 episodes for each task using a set of fixed random seeds for fair comparison. The paper does not explicitly define training, validation, and test splits from the 950 demonstrations or how the 50 evaluation episodes are sampled from them.
Hardware Specification Yes The training process can be finished on a single GeForce RTX 3090 in 1.5 days. ... The training process can be completed in less than one day on a single GeForce RTX 4090 GPU.
Software Dependencies No The paper mentions various architectures (VQ-VAE, Transformer, U-Net, ResNet-18) and optimizers (Adam W) but does not specify version numbers for programming languages or software libraries used for implementation (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes All the transformers used in our Concept Discovery Module refer to the structure of transformers used in (Brown et al., 2020) and have an inner embedding feature of 128 dimensions with 8 heads. The network E in Eq. 1 contains E (Eq. 2), which is a 4-layer transformer, and a VQVAE of 30 codebook items as A. The model G in Eq. 3 is a 2-layer transformer. The hyper-network HN in Eq. 16 is able to generate a feed-forward linear network of 2 hidden layers to former V in 5. The Π used in Eq. 7 is a 1-layer transformer. The R used in Eq. 8 is a 4-layer transformer. We employ the Adam W optimizer, coupled with a warm-up cosine annealing scheduler to modulate the learning rate. This scheduler initiates at 0.1 times the base learning rate, linearly increases the rate to the base level over the course of 1000 epochs, and subsequently reduces the learning rate to 0.1 times the base rate following a cosine function. The weight decay is always 1.0 10 3. We append all input sequences to the length of 440 and use a batch size of 16 during training. We train our model for 4000 epochs with a base learning rate of 1.0 10 4. The loss term Eq. 3 and Eq. 7 receive a weight of 1.0. The loss term Eq. 6 and Eq. 8 receive a weight of 1.0 10 3. For training, each task (along with its respective levels of variation) utilizes 950 randomly sampled demonstrations (see also Sec. 4). The policy network maintains an input observation length of 4 and an action prediction horizon of 8, with a batch size set at 150. To minimize overfitting, we employ the Adam W optimizer, incorporating a linear warmup of the learning rate. The initial learning rate is set at 1 10 4 and gradually decreases following a cosine annealing schedule as the number of iterations progresses. Training consists of 100 epochs in total. Additionally, we apply a weight decay of 1 10 6 to improve model generalization. For diffusion model training, we employ a beta schedule called squaredcos cap v2, with a beta range of 1e-4 to 2e-2, which optimizes the generation process by smoothly and controllably adjusting the introduction of noise levels.