Exploration in Reward Machines with Low Regret
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Exploration in Reward Machines with Low Regret. / Bourel, Hippolyte; Jonsson, Anders; Maillard, Odalric Ambrym; Talebi, Mohammad Sadegh.
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Vol. 206 PMLR, 2023. p. 4114-4146 (Proceedings of Machine Learning Research, Vol. 206).Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Exploration in Reward Machines with Low Regret
AU - Bourel, Hippolyte
AU - Jonsson, Anders
AU - Maillard, Odalric Ambrym
AU - Talebi, Mohammad Sadegh
N1 - Publisher Copyright: Copyright © 2023 by the author(s)
PY - 2023
Y1 - 2023
N2 - We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.
AB - We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.
UR - http://www.scopus.com/inward/record.url?scp=85165183137&partnerID=8YFLogxK
M3 - Article in proceedings
AN - SCOPUS:85165183137
VL - 206
T3 - Proceedings of Machine Learning Research
SP - 4114
EP - 4146
BT - Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
PB - PMLR
T2 - 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023
Y2 - 25 April 2023 through 27 April 2023
ER -
ID: 360396783