Recurrent Deep Multiagent Q-Learning for Autonomous Brokers in Smart Grid
Authors: Yaodong Yang, Jianye Hao, Mingyang Sun, Zan Wang, Changjie Fan, Goran Strbac
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first describe the tariff selection model for customers and other effective strategies. Afterward, we evaluate a DQN based broker and a Q-table based broker [Reddy and Veloso, 2011] in a simple setting to demonstrate the superior performance of DQN. SARSA is quite similar to Qlearning except Q-learning is an off-policy learning algorithm while SARSA is an on-policy one, and thus is not considered for evaluation here. Then we evaluate the performance of our RDMRL broker with our reward shaping mechanism and compare it with a single agent broker based on recurrent DQN and an RDMRL broker without reward shaping to show the superior performance of our reward shaping mechanism. The experiments demonstrate the superior performance of the proposed pricing strategy and highlight the effectiveness of our reward shaping mechanism. |
| Researcher Affiliation | Collaboration | Yaodong Yang1, Jianye Hao1 , Mingyang Sun2, Zan Wang1, Changjie Fan3 and Goran Strbac2 1 School of Computer Software, Tianjin University 2 Imperial College London 3 Net Ease, Inc. |
| Pseudocode | No | The complementary description of the recurrent deep Q-learning algorithm is omitted due to space limitation and can be found in an online appendix1. 1https://goo.gl/HHBYdg (The pseudocode is not present in the paper itself, but linked externally.) |
| Open Source Code | No | The complementary description of the recurrent deep Q-learning algorithm is omitted due to space limitation and can be found in an online appendix1. 1https://goo.gl/HHBYdg (The text indicates a description, not source code.) |
| Open Datasets | Yes | To evaluate our broker framework, we introduce real household electricity load measurements of London city over the past three years to simulate the retail market. The raw data consists of power consumption records of 5,567 households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014 [Energy Consumption Data, 2015]. And there remain 4,747 households after cleaning data with missing values. The running data is the household consumption data in the first week of 2013. [Energy Consumption Data, 2015] Electricity consumption in a sample of london households, 2015. https://data.london.gov.uk/dataset/smartmeter-energyuse-data-in-london-households. |
| Dataset Splits | No | To evaluate the learned strategy, we run 200 episodes for training and 100 episodes for evaluation. Training lasts for 200 episodes and the learned policy is evaluated for 100 episodes. (These refer to training and evaluation episodes in an RL setup, not explicit train/validation/test splits of a fixed dataset.) |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory) are mentioned in the paper for running experiments. |
| Software Dependencies | No | Our DQN is trained by RMSProp with a carefully selected learning rate of 0.0001, which yields good performance in our experiments. The ε greedy algorithm is used in the action selection process and ε decreases from 0.9 to 0 across training. LSTM has shown excellent modeling power for sequential data and powerful discriminative abilities [Wen et al., 2015]. (No specific version numbers for any software are provided.) |
| Experiment Setup | Yes | More specifically, we manually configure 1000 consumers and 100 producers as follows. The load of per consumer is 10k Wh while the production of per producer is 100k Wh. Thus the whole supply and demand are balanced in aggregate. The number of time slot per episode was fixed at 240. To evaluate the learned strategy, we run 200 episodes for training and 100 episodes for evaluation. Furthermore, the customer selection probability distribution χ is set as {40,30,20,10,0} to encourage reasonable prices. The margin profit µL, the initial consumer price, and the initial producer price are set to $0.02, $0.12 and $0.08 respectively by [Reddy and Veloso, 2011; Detailed State Data, 2010]. The network here we use only has one ordinary hidden layer with 24 units. Our DQN is trained by RMSProp with a carefully selected learning rate of 0.0001, which yields good performance in our experiments. The numbers of units in the two hidden layers are both set to 24 and output layer has six nodes in which each outputs the Q-value of an action. The ε greedy algorithm is used in the action selection process and ε decreases from 0.9 to 0 across training. Each recurrent DQN is trained by RMSProp with a learning rate of 0.0001. And the most recent six time step information is used, i.e., S =< Pt,Ut,Rt|t = 1,2,3,4,5,6 >. For the customer selection model, we configure the consumer initial expectation price range at [0.10,0.15] and producer s at [0.05,0.10]. Training lasts for 200 episodes and the learned policy is evaluated for 100 episodes. The length of each episode consists 7 days. |