In single-agent Markov decision processes, an agent can optimize its policy based on the interaction with environment . In multi-player Markov games, the interaction is non-stationary due to the behaviors of other players . The core is to evolve one’s policy according to not just its current in-game performance, but an aggregation of its performance over history . We show that for a variety of MGs, players in ourlearning scheme will provably converge to a point that is an approximation to Nash equilibrium . Combined with neural networks, we develop the algorithm, that is implemented in a reinforcement-learningframework and runs in a distributed way, with each player optimizing its policy on own observations . We use two numerical examples to validate theconvergence property on small-scale MGs with $n\ge 2

Author(s) : Yuanheng Zhu, Dongbin Zhao, Mengchen Zhao, Dong Li

Links : PDF - Abstract

Code :
Coursera

Keywords : policy - markov - player - performance - players -

Leave a Reply

Your email address will not be published. Required fields are marked *