Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

06/13/2021
by   Yujia Jin, et al.
0

We prove new upper and lower bounds for sample complexity of finding an ϵ-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most t_mix, we provide an algorithm that solves the problem using O(t_mixϵ^-3) (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on t_mix is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro