《Reinforcement Learning, An Introduction》(2rd Edition)的Excercise 3.8是这样的:

Suppose and the following sequence of rewards is received and , with . What are ? Hint: Work backwards.

在一份网上流传的Sutton本人给出的答案(未经验证)是这样的:

毫无疑问,。但是根据Return和Reward的递推公式:

很容易得到:

同理可得,这和Sutton给出的答案差异太大了!

通过正向的Return计算公式:

可以得到同样的结论。

难道是哪里理解错了?还是Sutton给出了的答案不对?希望各位看到的同道指点一二,多谢!