Batched Learning in Generalized Linear Contextual Bandits with General Decision Sets
Abstract
In real-world adaptive personalized decision making, due to physical and/or resource constraints, a decision maker often does not have the luxury of immediately incorporating the feedback from the previous individual into forming new policies for future individuals. This is an important aspect that has largely been abstracted away from the traditional online learning/decision making literature. In this paper, we study the problem of batched learning in generalized linear contextual bandits where the decision maker, unlike in traditional online learning, can only access feedback at the end of a limited number of batches, and when selecting actions within a batch, can only use information from prior batches. We provide a lower bound that characterizes the fundamental limit of performance in this setting and then give a DCB-based batched learning algorithm whose regret bound, obtained using a self-normalized martingale style analysis, nearly matches this lower bound. Our results provide a novel inquiry into generalized linear contextual bandits with arbitrary action sets, which include several bandits setting as special cases and thus shed light on batch-constrained decision making in general.