《Reinforcement Learning, An Introduction》(2rd Edition)的Excercise 3.8是这样的:
Suppose and the following sequence of rewards is received and , with . What are ? Hint: Work backwards.
在一份网上流传的Sutton本人给出的答案(未经验证)是这样的:
毫无疑问,。但是根据Return和Reward的递推公式:
很容易得到:
同理可得,这和Sutton给出的答案差异太大了!
通过正向的Return计算公式:
可以得到同样的结论。
难道是哪里理解错了?还是Sutton给出了的答案不对?希望各位看到的同道指点一二,多谢!