Uses Markov decision processes (MDPs) and Temporal Difference (TD) Q-learning to maximize reward in a "grid world". - View it on GitHub
Star
3
Rank
2932744