Keywords—microgrid; reinforcement learning; distributed control; multi-agent system; optimization; communication system; load shaping.
Currently, there is a trend in the electric power industry towards distributed and autonomous power systems. As a result, microgrids have been attracted increased research interest. Microgrid operation could be influenced by end-users or generation side objectives. A decentralized and distributed scheme may have some benefits compared to the conventional generation-centered approach observed in conventional grids. For example, a decentralization structure helps reducing the system operating cost and make it easier to have new customers connected [1, 2]. Also, an autonomous control scheme makes the system more flexible and resilient against external actions. Several microgrid pioneer projects have been developed and built around the world. Examples are UCSD microgrid in San Diego, U.S, Ta’u Tesla’s SolarCity in American Samoa, and Sendai microgrid in Japan . Some microgrid projects have shown great resistance and resilience during and after natural disasters. After the earthquake in east Japan, 2011, the local power distribution company in the city Sendai experienced a two days loss of service. However, the Sendai microgrid, a project built in Tohoku Fukushi University, kept one of its neighboring hospital operating until the power supply from the local power distribution company was recovered .
In the communication industry, there are very demanding service availability requiremens. Due to the critical role played by communication networks in nowadays’ society, a few companies have already explored using microgrids to power communication facilities and even integrated with a group of base stations . (Include also this reference here https://ieeexplore.ieee.org/abstract/document/8211687/) Therefore, in the foreseeable future, more communication microgrid projects might be introduced to gradually replace traditional power system. Unlike conventional power system, these new grids are more likely to be powered by distributed and renewable power generations, which implies that there is no central power unit and no generation/load schedule is provided. Due to the stochastic property of renewable power sources, there must be some form of energy buffer or storage devices in this system to handle power mismatches. Hence, a practical question applicable to these systems is how users should make use of the energy storage devices and minimize their operation cost. In some extreme conditions, like after a natural disaster, this problem could be a critical one: should a base station discharge its batteries to offer better communication service during the aftermath of such an event, or should battery charge be preserve as much as possible in order to provide service for a longer period at the expense of a lower quality of service? A few studies have been done considering the problem of tradeoff between service quality and energy storage [7, 8]. In these studies, the microgrid is coordinated and controlled by a centralized controller which thoroughly monitors the microgrid. The central controller utilized a Markov chain model to predict the state of charge (SoC) of battery units and plan the energy consumption accordingly. Reference  provided another approach that models the microgrid operation as a multi-agent game. Each agent in this game applies the indifference principle, a solving tool for Nash Equilibrium (NE) in game theory and obtain an operation strategy. And  proposed an optimization solver in computing the NE under cooperative and noncooperative conditions. These studies are proved to reach the optimal strategy or an NE, but they require an accurate model of the microgrid and its environment. However, in real- life applications, the communication infrastructure and environment information, such as load forecast, solar radiation, wind speed, temperature, etc., might be unavailable or inaccurate to the users. Therefore, in this paper, we will introduce a learning mechanism for multiple base stations in a microgrid to adjust its strategy based on its observation and gradually adapt itself to its environment.
Reinforcement learning (RL) has been shown to achieve great performance in solving multi-agent optimization problems in an unknown environment . In some power system marketing studies, RL was applied to improve agents’ decision making and increase their expected payoff in the energy market [12-14]. For example, in , an RL algorithm was used to obtain a cost-effective day-ahead load plan for charging a fleet of electric vehicles. RL has also been applied in energy management system to learn optimal load-generation planning such as the work done in [15-17] considering energy cost.
In this paper, an RL algorithm, Learning Automata (LA), is introduced to find an energy consumption plan for a communication microgrid considering communication service and stored energy level. This learning method allows agents to find the optimal policy that maximizes their payoff without much knowledge of the microgrid or environment.
II. microgrid energy management
The communication microgrid discussed in this paper includes a set of physical entities: decentralized renewable generation systems, communication base stations, and energy storage devices (batteries). The batteries are responsible for maintaining bus voltage by absorbing excess generated power or powering the load when the generated power is insufficient. An example of this microgrid is shown in Fig. 1. It is also assumed that the base stations in this microgrid have some controller so that they can locally act on their energy consumption. One of the methods to locally manage energy consumption is communication traffic shaping, which manipulates the data processed and transmitted in a base station . In a general form, the power consumption at each base station could be expressed as
Pbis the base power and
Pcis the controllable power as a linear function
of the control parameter
δ. In this study, the control variable
δis the communication shaping factor, which reflects the quality of the communication. Generally, the higher the
δ, the better the communication quality is. On the other end, the power generated by renewables, such as from photovoltaic (PV) modules and wind turbines, are partially stochastic. For example, the power generated by a solar panel depends on conditions like solar irradiation, temperature, cloud shading and many other factors  that are stochastic. Therefore, power generation of such sources could be described in a stochastic form as
Gris the power generated by a renewable source at time
θ1,θ2,…,θn, and X is the distribution that the power generation follows. Suppose the power difference between (2) and (1) is fulfilled by an energy storage device. Then, the battery instantaneous input power is
Gris a random variable, the battery output power follows a distribution Y
whose expectation and variance are
Therefore, the stored energy in the battery at a future time tcould be computed by a time integral of (4)
Discretizing time t in n small intervals
and assuming the power generation (2) is constant in the small period
∆t, according to Lyapunov central limit theory, the stored energy follows a normal distribution
μB=∑i=knEPBi∆t, σB2=∑i=knVarPB(i∆t) . (10)
As equation (5), (6) and (10) shows, the stored energy in the future is a random variable whose distribution depends on the renewable generation
Grand the load control variable
δ(t). Correspondingly, the state of store energy could be predicted by (9).
The base station controller usually has multiple objectives: provide high-quality communication service (low latency, high resolution), ensure steady power supply, maintain sufficient stored energy, and keep steady bus voltage, among other functions. Sometimes, these objectives are conflicting with each other. For example, by introducing some delay in data transmission, a communication base station could lower its energy consumption, resulting in a higher expected stored energy
μBbut worsening the quality of service (QoS). Thus, the base station controller needs an objective function to evaluate whether a choice of control variable
δis optimal. Suppose the objective function is related to the power consumption and stored energy distribution as shown in
Then, the goals of all the base stations’ controllers are to find a control strategy of
δ(t)that maximizes the objective function (11). In , the control strategy is obtained by applying the indifference principle to base stations. As mentioned before, this process relies on an accurate model of the microgrid such as generation
Grand load curve
Ptotal. Additionally, the computation time of a multi-agent game increases exponentially as the number of agents increases. Thus, in the next section, reinforcement learning is introduced to find this control strategy.
III. Energy management game and learning algorithm
In the studied microgrid, the communication base stations controller are so-called agents. These agents’ payoff depends not only on the actions it takes but also on other agents’ moves as well. And due to the stochastic environment, the agents do not have stationary and deterministic policies. Therefore, a stochastic game is introduced to represent the non-stationary interactions of agents within the microgrid . Then, the Learning Automata method is introduced to solve for the optimal policy for each agent which converges to an NE, which is a set of policies for all agents in the game and no agent can benefit by deviating its policy from NE.
- Stochastic game
A stochastic game is an effective tool in modeling stochastic multi-agent system [20, 21]. In this paper, an N-player Markov game is applied to model the stochastic environment and behavior of all players. In this game, each player i has a set of actions,
1≤i≤N, which represents their choice for
δin (1) and assume the number of actions is
mi. During each play of the game, each of the players choose an action. The result of each play is a random payoff to each player. Let
Obj(δ1,…,δ2)denote the payoff of player i. The objective of each player is to maximize its expected payoff as shown in
uia1,⋯,aN=Eri|player j chose action δj, δj∈Si, 1≤j≤N
The function (12) is the payoff or utility function of player i. A strategy or policy of a player can be defined as a probability vector
qi=[qi1,⋯,qim], where player i chooses action j with probability
qij. And the expected payoff could be extended to the set of all strategies as shown in
diq1,⋯,qN=Eri|player j employs strategy qj, 1≤j≤N
In this project, the goal for the learning method is to adapt the agent’s strategy so that they converge to an NE. the condition of a strategy tuple
(q10,⋯,qN0)being a NE in the stochastic game is
In general, each
qi0above is a mixed strategy and every N-person game has at least one NE in mixed strategies . If all
qi0are unit probability vectors, then
(q10,⋯,qN0)is said to be a pure NE.
- Learning algorithm
A team of learning automata is applied to evolve to NE in the stochastic game. A learning automaton learns the optimal strategy through interactions with its environment and adapts its decision-making process in a trial-and-error manner. In the microgrid case, the automaton keeps a probability distribution over its available actions and at each game it chooses one action based on this distribution. The environment then sends back a random reward for this choice of action. Then, the automaton uses this reward to update its action probability distribution using a learning algorithm and the cycle repeats. This process is similar to that of a policy iteration, but the updating law is different . The updating sequence of the learning automata are shown below:
- At time t, the player (automaton) choose an action according to its action probability vectorqi. Suppose the action taken is
- Each player obtains a payoff based on the set of all players’ actions. The reward of player i isri(t).
- Each player updates his action probability according to the rule
pit+1=pit+b∙riteδi-pit, i=1,…, N
where 0<b<1 is a learning rate parameter and
eδiis a unit vector with its
δith component unity. This algorithm is also known as Linear Reward-Inaction algorithm
LR-I. One important feature of this learning algorithm is that the convergence of the players’ strategies is guaranteed if the payoff of all the players are the same. I.e.,
When (16) is satisfied, the learning algorithm drives the players towards a pure NE in the stochastic game .
IV. Evaluation of the Analysis
In this section, a communication microgrid with ten base stations is modeled. The microgrid is powered by a PV cell array and a battery unit. An objective function is applied to evaluate the strategy optimality
rδ= δ̅+pEBt≥EBtgoal, p(EBtend≥EBtendgoal)≥0.51-δ̅, pEBtend≥EBtendgoal<0.5
Figure 3: Obtained distribution of actions for one agent after learning
Figure 2: Trained communication shaping factor strategy and SoC curve
EBtendgoalis the desired stored energy at time
tendset by system operator,
EBtendis the expected system’s stored energy at
tend, the probability
p(EBtend≥EBtendgoal)is computed using (1)-(10), and
δ̅is the averaged communication shaping factor which represents total communication quality. This objective function is not utilized during the training by agents because it requires accurate prediction from the power generation. Instead, the agents are given a simplified reward function which only contains the SoC and instantaneous
δias shown in
rδ=wδ∙δ̅t∙SoCt+wSoC∙SoCt,SoCt<SoCgoal δ̅, SoCt≥SoCgoal 1-δ̅t, SoCt≤ SoCmin
wSoCare two weighting factors,
SoCgoalis the goal SoC level,and
SoCminis the minimum allowed battery SoC. The design purpose of (18) is to encourage agents to choose a higher
δ(t)when the SoC is high and vice-versa. Since the average communciation shaping factor
δ̅(t)is related to all players’ actions, in this microgrid, it is assumed that the agents could share their choices of actions with each other.
In the simulation, the stored energy goal
EBtgoalis set to reach 80% SoC at the end of operation. The average communication shaping factor and corresponding battery SoC curve during a 50-day training is shown in Fig. 2. As the figure shows, the agents developed a load plan based on the SoC trends and reached a sufficient SoC. The action distribution of an agent at different SoC level is shown in Fig. 3. As Fig. 3 shows, at each SoC level, the distribution vector of actions is converging to a unit vector.
The cumulative objective score of the learning process evaluated using (19) is shown in Fig. 4. The total reward of the system converged to a steady value in 10 days, which is acceptable since most communication base stations require a tuning period before formal commisioning. The SoC of the system during learning process is also indicated through Fig. 5, showing that the stored energy is above 80% of the total capacity after 10 days and rised to the desired level sufficiently fast.
A comparison of the RL strategy with the one obtained by a globally exhaustic searching is shown in Fig. 6. Druing the exhasustic search, a central controller calculates (16) along the operating period and searches for the
δ̅that maximizes the payoff in a exhuastic way. The comparison result shown in Fig. 6 indicates that the strategy obtained by reinforcement learning is similar to the one obtained by globally exhaustic searching.
Figure 6: Comparison in objective function between learning automata and exhasutic search
This paper proposed a learning mechanism for microgrids to optimize energy management issue. In this particular case, the analysis focuses on a system applicable to wireless communication networks but the same approach can be used in other applications with a partially controllable load. The base stations in the microgrid are modeled as individual agents and apply reinforcement learning algorithm to update their action strategy. Given the ability to share action choices, the algorithm is guaranteed to converge to a pure Nash equilibrium. The simulation results show that the energy management strategy obtained by the learning method is comparable to that of a globally optimal one. The main benefit of this learning method is that it requires no pre-knowledge of the microgrid nor of the operating environment. The learning nature of the method makes it possible to have it adapted to a varying environment, which may be a possible path for further research in the future.
Figure 4: Learning curve of an agent
 M. Prihandrijanti, A. Malisie, and R. Otterpohl, “Cost–Benefit Analysis for Centralized and Decentralized Wastewater Treatment System (Case Study in Surabaya-Indonesia),” Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 259-268.
 B. Römer, P. Reichhart, J. Kranz, and A. Picot, “The role of smart metering and decentralized electricity storage for smart grids: The importance of positive externalities,” Energy Policy, vol. 50, pp. 486-495, 2012.
 Microgridmedia. (2018). Microgrid Projects: Global Microgrid Map from Microgrid Media. Available: http://microgridprojects.com/
 “Microgrids for disaster preparedness and recovery with electricity continuity plans and systems: with electricity continuity plans and systems,” in Premium Official News U6 – Newspaper Article, ed: Plus Media Solutions, 2015.
Figure 5: SoC during the learning process
 A. Kwasinski, “Lessons from Field Damage Assessments about Communication Networks Power Supply and Infrastructure Performance during Natural Disasters with a focus on Hurricane Sandy,” 2013.
 A. Mohsenian-Rad, V. W. S. Wong, J. Jatskevich, R. Schober, and A. Leon-Garcia, “Autonomous Demand-Side Management Based on Game-Theoretic Energy Consumption Scheduling for the Future Smart Grid,” IEEE Transactions on Smart Grid, vol. 1, no. 3, pp. 320-331, 2010.
 A. Kwasinski and A. Kwasinski, “Operational aspects and power architecture design for a microgrid to increase the use of renewable energy in wireless communication networks,” in 2014 International Power Electronics Conference (IPEC-Hiroshima 2014 – ECCE ASIA), 2014, pp. 2649-2655.
 A. Kwasinski and P. T. Krein, “Telecom power planning for natural and man-made disasters,” in INTELEC 07 – 29th International Telecommunications Energy Conference, 2007, pp. 216-222.
 R. Hu, A. Kwasinski, and A. Kwasinski, “Adaptive mixed strategy load management in dc microgrids for wireless communications systems,” in 2017 IEEE 3rd International Future Energy Electronics Conference and ECCE Asia (IFEEC 2017 – ECCE Asia), 2017, pp. 743-748.
 X. Liu, B. Gao, Z. Zhu, and Y. Tang, “Non-cooperative and cooperative optimisation of battery energy storage system for energy management in multi-microgrid,” IET Generation, Transmission and Distribution, vol. 12, no. 10, pp. 2369-2377, 2018.
 M. Wiering and M. v. Otterlo, Reinforcement learning: state-of-the-art, 1. Aufl.;2012; ed. (no. Book, Whole). Heidelberg;New York;: Springer, 2012.
 M. Rahimiyan and H. R. Mashhadi, “An Adaptive Q-Learning Algorithm Developed for Agent-Based Computational Modeling of Electricity Market,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 5, pp. 547-556, 2010.
 D. Li and S. K. Jayaweera, “Distributed Smart-Home Decision-Making in a Hierarchical Interactive Smart Grid Architecture,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 1, pp. 75-84, 2015.
 S. Vandael, B. Claessens, D. Ernst, T. Holvoet, and G. Deconinck, “Reinforcement Learning of Heuristic EV Fleet Charging in a Day-Ahead Electricity Market,” IEEE Transactions on Smart Grid, vol. 6, no. 4, pp. 1795-1805, 2015.
 R. Leo, R. S. Milton, and S. Sibi, “Reinforcement learning for optimal energy management of a solar microgrid,” in 2014 IEEE Global Humanitarian Technology Conference – South Asia Satellite (GHTC-SAS), 2014, pp. 183-188.
 L. Eller, L. C. Siafara, and T. Sauter, “Adaptive control for building energy management using reinforcement learning,” in 2018 IEEE International Conference on Industrial Technology (ICIT), 2018, pp. 1562-1567.
 Q. Sun, D. Wang, D. Ma, and B. Huang, “Multi-objective energy management for we-energy in Energy Internet using reinforcement learning,” in 2017 IEEE Symposium Series on Computational Intelligence (SSCI), 2017, pp. 1-6.
 K. Kiela, “Photovoltaic Cells,” Mokslas : Lietuvos Ateitis, vol. 4, no. 1, pp. 56-62, 2012.
 M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” presented at the Proceedings of the Eleventh International Conference on International Conference on Machine Learning, New Brunswick, NJ, USA, 1994.
 M. Maschler, E. Solan, and S. Zamir, Game Theory. Cambridge: Cambridge University Press, 2013.
 E. Solan and N. Vieille, “Stochastic games,” Proceedings of the National Academy of Sciences, vol. 112, no. 45, pp. 13743-13746, 2015.
 R. S. Sutton, A. G. Barto, and I. netLibrary, Reinforcement learning: an introduction (no. Book, Whole). Cambridge, Mass: MIT Press, 1998.
 K. S. Narendra and M. A. L. Thathachar, Learning automata: an introduction. Prentice-Hall, Inc., 1989, p. 476.
 P. S. Sastry, V. V. Phansalkar, and M. A. L. Thathachar, “Decentralized learning of Nash equilibria in multi-person stochastic games with incomplete information,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 24, no. 5, pp. 769-777, 1994.