## Brief Research Summary

Reinforcement Learning (RL) is a powerful sampling-based technique that solves Markov Decision Processes (MDPs), i.e., problems in which an agent receives a reward as a consequence of the current situation (state) and the selected decision (action); because the current action influences the future states and, in turn, the future rewards, this problem is non-trivial. RL has proven to be a very successful technique, managing, e.g., to beat Chess and Go masters (both human and algorithms).

While very powerful, RL is not exempt from drawbacks and open issues. Among them, the difficulty in providing explainability and guaranteeing safety and stability is very relevant not only for the control community, but also to make it possible to apply RL in real-world mission-critical settings such as, e.g., autonomous driving.

My research activity in this domain focuses on obtaining stability and safety guarantees as well as providing explainability of the RL solution. To that end, rather than relying on the popular Neural Networks (NNs) as function approximators, we propose to use structured function approximators such as Parametric Nonlinear Programs (PNLPs), and in particular Model Predictive Control (MPC).

#### Structured Function Approximators for RL

The key idea of our approach is that, by enforcing a specific function approximator structure, one is able to force RL to only explore parameter configurations that are provably safe. In other words, using MPC as a structured function approximator allows one to introduce rigorous safety and stability guarantees in RL. Moreover, as MPC provides a prediction of the future system evolution, this further favors explainability of RL.

##### A Simple Example: Safe and Stabilizing RL by Means of Robust MPC

In the example above, safety is introduced by supporting the RL functions using robust MPC. RL then tunes the MPC parameters: Hessian, gradient and constant associated with an initial cost (used for cost rotation in the context of economic MPC), Hessian of a purely quadratic cost penalizing deviations from some steady-state reference, the steady-state reference, and the matrix describing the polyhedral uncertainty set with a fixed amount of facets (4 facets).

The RL objective consists in pushing the system as close as possible to the steady-state (-3,0) without violating the constraints that position, velocity, and acceleration should be in the interval [-1,1]. The RL constraints include the fact that the stage cost Hessian must be positive definite, that the reference should be a feasible steady state, that the uncertainty set should include all observed noise samples, and that the terminal set be nonempty.

In the left figure, the current state is displayed as a green circle, the past trajectory is displayed as a black line, the uncertainty tube is displayed in red, the minimum robust positive invariant (mRPI) set the system is guaranteed to converge to is displayed by the yellow border, the constraint set is displayed by the black border, and the terminal set (robust positive invariant) is displayed in cyan.

In the right figure, the true uncertainty set is displayed as the black border, the noise samples as black dots whose convex hull vertices are in red, and the uncertainty set selected by RL and used within MPC to guarantee robustness is displayed in cyan. This set remains essentially unchanged until the mRPI set is pushed close enough to the boundary of the constraint set, as until then, the closed-loop cost does not depend on the uncertainty set. Afterwards, the part of the uncertainty set which is most relevant to reduce the closed-loop cost is approximated well. In contrast, the top-right part of the set is not important for the control task and is not accurately learned.

One RL update is taken every 20 time steps, in order to construct a minibatch of data to be used to compute the gradients in a constrained Q-learning approach.

All details about the problem formulation can be found in

Sebastien Gros, Mario Zanon. **Learning for MPC with Stability & Safety Guarantees**, Automatica, Vol. 146, 2022, 110598, ISSN 0005-1098

M. Zanon and S. Gros. **Safe Reinforcement Learning Using Robust MPC**, in IEEE Transactions on Automatic Control, Vol. 66, no. 8, pp. 3638-3652, 2021.

#### Publications

**Journal:**

- S. Gros and M. Zanon.
**Learning for MPC with Stability & Safety Guarantees.**Automatica, 2022 - M. Zanon, S. Gros and M. Palladino.
**Stability-Constrained Markov Decision Processes Using MPC.**Automatica, 2022 - M. Zanon and S. Gros.
**Safe Reinforcement Learning Using Robust MPC.**IEEE Transactions on Automatic Control, 2021 - S. Gros and M. Zanon.
**Data-driven Economic NMPC using Reinforcement Learning.**IEEE Transactions on Automatic Control, 2020

**Conference:**

- S. Menchetti, M. Zanon and A. Bemporad.
**Linear Observer Learning by Temporal Difference.**Proceedings of the Conference on Decision and Control (CDC), 2022 - S. Gros and M. Zanon.
**Reinforcement Learning based on MPC and the Stochastic Policy Gradient Method.**Proceedings of the American Control Conference (ACC), 2021 - S. Gros and M. Zanon.
**Bias Correction in Reinforcement Learning via the Deterministic Policy Gradient Method for MPC-Based Policies.**Proceedings of the American Control Conference (ACC), 2021 - M. Zanon, V. Kungurtsev and S. Gros.
**Reinforcement Learning Based on Real-Time Iteration NMPC.**Proceedings of the World Congress of the International Federation of Automatic Control, 2020 - S. Gros and M. Zanon.
**Reinforcement Learning for Mixed-Integer Problems Based on MPC.**Proceedings of the World Congress of the International Federation of Automatic Control, 2020 - S. Gros, M. Zanon and A. Bemporad.
**Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?**Proceedings of the World Congress of the International Federation of Automatic Control, 2020 - M. Zanon, S. Gros and A. Bemporad.
**Practical Reinforcement Learning of Stabilizing Economic MPC.**Proceedings of the European Control Conference (ECC), 2019