Thompson Sampling for Stochastic Control: The Finite Parameter Case

Kim MJ

IEEE Transactions on Automatic Control, Vol.62, No.12, 6415-6422, 2017

DOI10.1109/TAC.2017.2653942 Export Citation

Thompson Sampling for Stochastic Control: The Finite Parameter Case

In this paper, we apply Thompson sampling to a class of average reward stochastic control problems with parameter uncertainty. Specifically, we study an average reward stochastic control problem over an infinite horizon in which both the reward and state transition distributions are parameterized by an unknown parameter taking values in a finite space. The main result of this paper is a proof showing that Thompson sampling achieves a worst case average per period regret of O(T-1), which is asymptotically optimal.

Keywords:Average regret bounds;Bayesian learning;posterior convergence rate;Thompson sampling