Policy Mirror Descent Inherently Explores Action Space

03/08/2023
by   Yan Li, et al.
0

Designing computationally efficient exploration strategies for on-policy first-order methods that attain optimal 𝒪(1/ϵ^2) sample complexity remains open for solving Markov decision processes (MDP). This manuscript provides an answer to this question from a perspective of simplicity, by showing that whenever exploration over the state space is implied by the MDP structure, there seems to be little need for sophisticated exploration strategies. We revisit a stochastic policy gradient method, named stochastic policy mirror descent, applied to the infinite horizon, discounted MDP with finite state and action spaces. Accompanying SPMD we present two on-policy evaluation operators, both simply following the policy for trajectory collection with no explicit exploration, or any form of intervention. SPMD with the first evaluation operator, named value-based estimation, tailors to the Kullback-Leibler (KL) divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an 𝒪̃( 1 / ϵ^2) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, named truncated on-policy Monte Carlo, attains an 𝒪̃(ℋ_𝒟 / ϵ^2) sample complexity, with the same assumption on the state chains of generated policies. We characterize ℋ_𝒟 as a divergence-dependent function of the effective horizon and the size of the action space, which leads to an exponential dependence of the latter two quantities for the KL divergence, and a polynomial dependence for the divergence induced by negative Tsallis entropy. These obtained sample complexities seem to be new among on-policy stochastic policy gradient methods without explicit explorations.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro