reproducibilityindex.ai

Continual Task Allocation in Meta-Policy Network via Sparse Prompting

Authors: Yijun Yang, Tianyi Zhou, Jing Jiang, Guodong Long, Yuhui Shi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, Co TASP achieves a promising plasticity-stability trade-off without storing or replaying any past tasks experiences. It outperforms existing continual and multi-task RL methods on all seen tasks, forgetting reduction, and generalization to unseen tasks. Our code is available at https://github.com/stevenyangyj/Co TASP
Researcher Affiliation	Academia	1Southern University of Science and Technology 2University of Technology Sydney 3University of Maryland, College Park.
Pseudocode	Yes	Algorithm 1 Dictionary Learning 1: input: D(l) for hidden layer-l, A(l) = [a(l) 1 , . . . , a(l) k ] Rk k = Pt i=1 α (l) i α (l)T i , B(l) = [b(l) 1 , . . . , b(l) k ] Rm k = Pt i=1 eiα (l)T i , and constant c 2: while until convergence do 3: for j = 1 to k do 4: z = 1/A(l) jj (b(l) j D(l)a(l) j ) + D(l)[j] 5: D(l)[j] = min{ c z 2 , 1}z ℓ2-norm constraint 6: output: updated D(l); Algorithm 2 Training Procedure of Co TASP 1: initialize: replay buffer B = , meta-policy network πθ with L layers, critic Q, dictionaries {D(l) 0 }L 1 l=1 , A(l) 0 , B(l) 0 , ˆϕ(l) 0 0, 0, 0, and constant c for Alg. 1 2: input: training budget Iθ, Iα, and step function σ( ) 3: for t = 1 to T do 4: et = f S-BERT(textual description of task t) 5: Initialize {α(l) t }L 1 l=1 by solving Eq. 3 6: Extract task-specific π by Eq. 2 with {σ(α(l) t )}L 1 l=1 7: for each iteration do Learning task t with SAC 8: for i = 1 to Iθ do Optimizing θ 9: Collect τ = {st, at, rt, s t} with π 10: Update B and sample a mini-batch τ 11: Gradient descent on Q 12: Update θ by Eq. 4 with {ˆϕ(l) t 1}L 1 l=1 13: for i = 1 to Iα do Optimizing α 14: Collect τ = {st, at, rt, s t} with π 15: Update B and sample a mini-batch τ 16: Gradient descent on Q 17: Gradient descent on {α(l) t }L 1 l=1 by STE 18: for l = 1 to L 1 do Dictionary learning 19: ˆϕ(l) t ˆϕ(l) t 1 σ(α (l) t ) 20: A(l) t A(l) t 1 + α (l) t α (l)T t 21: B(l) t B(l) t 1 + etα (l)T t 22: Get updated D(l) t by Alg. 1 with D(l) t 1 23: output: θ and {D (l)}L 1 l=1 ;
Open Source Code	Yes	Our code is available at https://github.com/stevenyangyj/Co TASP
Open Datasets	Yes	To evaluate Co TASP, we follow the same settings as prior work (Wolczyk et al., 2022) and perform thorough experiments. Specifically, we primarily use CW10, a benchmark in the Continual World (CW) (Wolczyk et al., 2021), which consists of 10 representative manipulation tasks from Meta World (Yu et al., 2019).
Dataset Splits	No	Note that we stop the training when the success rate in two consecutive evaluations reaches the threshold (set to 0.9). (This describes a stopping criterion based on performance, not a specific train/validation/test split for the dataset itself.)
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments. It only describes the model architecture (e.g., MLPs with certain hidden layers and neurons).
Software Dependencies	No	We carefully tune the hyperparameters for a JAX implementation of the SAC algorithm (Bradbury et al., 2018; Kostrikov, 2021), and they are common for all baseline methods. (This mentions JAX and SAC but does not provide specific version numbers for them or any other software dependencies.)
Experiment Setup	Yes	The actor and the critic are implemented as two separate multi-layer perceptron (MLP) networks, each with 4 hidden layers of 256 neurons. For structure-based methods (Pack Net, HAT) and our proposed Co TASP, a wider MLP network with 1024 neurons per layer is used as the actor. We refer to these hidden layers as the backbone and the last output layer as the head. Unlike other continual RL methods (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Serr a et al., 2018; Kessler et al., 2022) which rely on using a separate head for each new task, Co TASP uses a single-head setting where only one head is used for all tasks. In this case, Co TASP does not require selecting the appropriate head for each task and enables the reuse of parameters between similar tasks. According to (Wolczyk et al., 2021), regularizing the critic often leads to a decline in performance. Therefore, we completely ignore the forgetting issue in the critic network and retrain it for each new task. More details on the hyperparameters used in training can be found in the Appendix D. Table 4: Hyperparameters of Co TASP for Continual World experiments