Continual Task Allocation in Meta-Policy Network via Sparse Prompting
Authors: Yijun Yang, Tianyi Zhou, Jing Jiang, Guodong Long, Yuhui Shi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, Co TASP achieves a promising plasticity-stability trade-off without storing or replaying any past tasks experiences. It outperforms existing continual and multi-task RL methods on all seen tasks, forgetting reduction, and generalization to unseen tasks. Our code is available at https://github.com/stevenyangyj/Co TASP |
| Researcher Affiliation | Academia | 1Southern University of Science and Technology 2University of Technology Sydney 3University of Maryland, College Park. |
| Pseudocode | Yes | Algorithm 1 Dictionary Learning 1: input: D(l) for hidden layer-l, A(l) = [a(l) 1 , . . . , a(l) k ] Rk k = Pt i=1 α (l) i α (l)T i , B(l) = [b(l) 1 , . . . , b(l) k ] Rm k = Pt i=1 eiα (l)T i , and constant c 2: while until convergence do 3: for j = 1 to k do 4: z = 1/A(l) jj (b(l) j D(l)a(l) j ) + D(l)[j] 5: D(l)[j] = min{ c z 2 , 1}z ℓ2-norm constraint 6: output: updated D(l); Algorithm 2 Training Procedure of Co TASP 1: initialize: replay buffer B = , meta-policy network πθ with L layers, critic Q, dictionaries {D(l) 0 }L 1 l=1 , A(l) 0 , B(l) 0 , ˆϕ(l) 0 0, 0, 0, and constant c for Alg. 1 2: input: training budget Iθ, Iα, and step function σ( ) 3: for t = 1 to T do 4: et = f S-BERT(textual description of task t) 5: Initialize {α(l) t }L 1 l=1 by solving Eq. 3 6: Extract task-specific π by Eq. 2 with {σ(α(l) t )}L 1 l=1 7: for each iteration do Learning task t with SAC 8: for i = 1 to Iθ do Optimizing θ 9: Collect τ = {st, at, rt, s t} with π 10: Update B and sample a mini-batch τ 11: Gradient descent on Q 12: Update θ by Eq. 4 with {ˆϕ(l) t 1}L 1 l=1 13: for i = 1 to Iα do Optimizing α 14: Collect τ = {st, at, rt, s t} with π 15: Update B and sample a mini-batch τ 16: Gradient descent on Q 17: Gradient descent on {α(l) t }L 1 l=1 by STE 18: for l = 1 to L 1 do Dictionary learning 19: ˆϕ(l) t ˆϕ(l) t 1 σ(α (l) t ) 20: A(l) t A(l) t 1 + α (l) t α (l)T t 21: B(l) t B(l) t 1 + etα (l)T t 22: Get updated D(l) t by Alg. 1 with D(l) t 1 23: output: θ and {D (l)}L 1 l=1 ; |
| Open Source Code | Yes | Our code is available at https://github.com/stevenyangyj/Co TASP |
| Open Datasets | Yes | To evaluate Co TASP, we follow the same settings as prior work (Wolczyk et al., 2022) and perform thorough experiments. Specifically, we primarily use CW10, a benchmark in the Continual World (CW) (Wolczyk et al., 2021), which consists of 10 representative manipulation tasks from Meta World (Yu et al., 2019). |
| Dataset Splits | No | Note that we stop the training when the success rate in two consecutive evaluations reaches the threshold (set to 0.9). (This describes a stopping criterion based on performance, not a specific train/validation/test split for the dataset itself.) |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments. It only describes the model architecture (e.g., MLPs with certain hidden layers and neurons). |
| Software Dependencies | No | We carefully tune the hyperparameters for a JAX implementation of the SAC algorithm (Bradbury et al., 2018; Kostrikov, 2021), and they are common for all baseline methods. (This mentions JAX and SAC but does not provide specific version numbers for them or any other software dependencies.) |
| Experiment Setup | Yes | The actor and the critic are implemented as two separate multi-layer perceptron (MLP) networks, each with 4 hidden layers of 256 neurons. For structure-based methods (Pack Net, HAT) and our proposed Co TASP, a wider MLP network with 1024 neurons per layer is used as the actor. We refer to these hidden layers as the backbone and the last output layer as the head. Unlike other continual RL methods (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Serr a et al., 2018; Kessler et al., 2022) which rely on using a separate head for each new task, Co TASP uses a single-head setting where only one head is used for all tasks. In this case, Co TASP does not require selecting the appropriate head for each task and enables the reuse of parameters between similar tasks. According to (Wolczyk et al., 2021), regularizing the critic often leads to a decline in performance. Therefore, we completely ignore the forgetting issue in the critic network and retrain it for each new task. More details on the hyperparameters used in training can be found in the Appendix D. Table 4: Hyperparameters of Co TASP for Continual World experiments |