Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

04/28/2023
by   Hao Qin, et al.
0

We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0,1] interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling <cit.>, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting <cit.> while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form O(√(μ^*(1-μ^*) K T ln K) + K ln T), where μ^* is the expected reward of the optimal arm, and T is the time horizon length.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro