Refined Policy Improvement Bounds for MDPs

07/16/2021

∙

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in <cit.> and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.

READ FULL TEXT

Refined Policy Improvement Bounds for MDPs

Sign in with Google

Consider DeepAI Pro