Abstract
This paper presents a theoretical and empirical analysis of Expected Sarsa, a variation on Sarsa, the classic on policy
temporal-difference method for model-free reinforcement
learning. Expected Sarsa exploits knowledge about stochasticity
in the behavior policy to perform updates with lower variance.
Doing so allows for higher learning rates and thus faster learning.
In deterministic environments, Expected Sarsa’s updates
have zero variance, enabling a learning rate of 1. We prove
that Expected Sarsa converges under the same conditions as
Sarsa and formulate specific hypotheses about when Expected
Sarsa will outperform Sarsa and Q-learning. Experiments in
multiple domains confirm these hypotheses and demonstrate
that Expected Sarsa has significant advantages over these more
commonly used methods.
temporal-difference method for model-free reinforcement
learning. Expected Sarsa exploits knowledge about stochasticity
in the behavior policy to perform updates with lower variance.
Doing so allows for higher learning rates and thus faster learning.
In deterministic environments, Expected Sarsa’s updates
have zero variance, enabling a learning rate of 1. We prove
that Expected Sarsa converges under the same conditions as
Sarsa and formulate specific hypotheses about when Expected
Sarsa will outperform Sarsa and Q-learning. Experiments in
multiple domains confirm these hypotheses and demonstrate
that Expected Sarsa has significant advantages over these more
commonly used methods.
Original language | English |
---|---|
Title of host publication | Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning |
Subtitle of host publication | ADPRL |
Publication status | Published - 2009 |
Keywords
- Reinforcement learning