Reinforcement learning in complex environments may require supervision toprevent the agent from attempting dangerous actions . We present the Modified-Action MarkovDecision Process, an extension of the MDP model that allows actions to differ from the policy . We analyze the asymptotic behaviours of common reinforcementlearning algorithms in this setting and show that they adapt in different ways: Some completely ignore modifications while others go to various lengths intrying to avoid action modifications that decrease reward . By choosing theright algorithm, developers can prevent their agents from learning tocircumvent interruptions or constraints, and better control agent responses to other kinds of action modification, like self-damage .
Author(s) : Eric D. Langlois, Tom EverittLinks : PDF - Abstract
Code :
https://github.com/mtrazzi/two-step-task
Keywords : action - actions - agents - modified - agent -