Gradient-free online learning in games with delayed rewards
Am lie H liou, Panayotis Mertikopoulos, et al.
ICML 2020
In real-world adaptive personalized decision making, due to physical and/or resource constraints, a decision maker often does not have the luxury of immediately incorporating the feedback from the previous individual into forming new policies for future individuals. This is an important aspect that has largely been abstracted away from the traditional online learning/decision making literature. In this letter, we study the problem of batched learning in generalized linear contextual bandits where the decision maker, unlike in traditional online learning, can only access feedback at the end of a limited number of batches, and when selecting actions within a batch, can only use information from prior batches. We provide a lower bound that characterizes the fundamental limit of performance in this setting and then give a UCB-based batched learning algorithm whose regret bound, obtained using a self-normalized martingale style analysis, nearly matches this lower bound. Our results provide a novel inquiry into generalized linear contextual bandits with arbitrary action sets, which include several bandits setting as special cases and thus shed light on batch-constrained decision making in general.
Am lie H liou, Panayotis Mertikopoulos, et al.
ICML 2020
Zhaolin Ren, Zhengyuan Zhou, et al.
AAAI 2020
Nian Si, Fan Zhang, et al.
ICML 2020
Tsuyoshi Idé, Ankush Khandelwal, et al.
ICDM 2016