《Reinforcement Learning, An Introduction》(2rd Edition)的Excercise 3.8是这样的:
Suppose γ=0.5 and the following sequence of rewards is received R1=1,R2=2,R3=6,R4=3, and R5=2, with T=5. What are G0,G1,...,G5? Hint: Work backwards.
在一份网上流传的Sutton本人给出的答案(未经验证)是这样的:
G0=2,G1=3,G2=2,G3=12,G4=18,G5=0
毫无疑问,G5=0。但是根据Return和Reward的递推公式:
Gt=Rt+1+γGt+1很容易得到:
G4=R5+γG5=2+0=2同理可得G3=4,G2=8,G1=6,G0=2,这和Sutton给出的答案差异太大了!
通过正向的Return计算公式:
Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...γT−t−1RT−t可以得到同样的结论。
难道是哪里理解错了?还是Sutton给出了的答案不对?希望各位看到的同道指点一二,多谢!