Contents

下面的推导过程中第2步和第5步两次用到重期望公式: \(\bf{EX}=\bf{E\left(E\left[X\mid Y\right]\right)}\)

\[\begin{split} \upsilon_{\pi}(s)&={\bf{E_{\pi}}}\left[G_{t}\mid{S_{t}=s}\right] \\ &={\bf{E_{\pi}}}\left({\bf{E_{\pi}}}\left[G_t\mid S_t=s,A_t\right]\right) \\ &={\bf{E_{\pi}}}\left[\sum_a\pi(a|s)G_t\mid S_t=s,A_t=a\right] \\ &=\sum_a\pi(a|s){\bf{E_{\pi}}}\left[G_t\mid S_t=s,A_t=a\right] \\ &=\sum_a\pi(a|s){\bf{E_{\pi}}}\left({\bf{E_{\pi}}}\left[G_t\mid S_t=s,A_t=a,S_{t+1}\right]\right) \\ &=\sum_a\pi(a|s){\bf{E_{\pi}}}\left[\sum_{s^{'}}p(s^{'}\mid s,a)G_t\mid S_t=s,A_t=a,S_{t+1}=s^{'}\right] \\ &=\sum_a\pi(a|s)\sum_{s^{'}}p(s^{'}\mid s,a){\bf{E_{\pi}}}\left[G_t\mid S_t=s,A_t=a,S_{t+1}=s^{'}\right] \\ &=\sum_{a}\pi(a\mid{s})\sum_{s^{'}}p(s^{'}\mid s,a){\bf E}_{\pi}\left[R_{t+1}+\gamma\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+2}\mid{S_{t}=s,A_{t}=a,S_{t+1}=s^{'}}\right] \\ &=\sum_{a}\pi(a\mid{s})\sum_{s^{'}}p(s^{'}\mid{s,a})\left[r(s,a,s^{'})+\gamma{\bf E}_{\pi}\left[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+2}\mid{S_{t+1}=s^{'}}\right]\right] \\ &=\sum_{a}\pi(a\mid{s})\sum_{s^{'}}p(s^{'}\mid{s,a})\left[r(s,a,s^{'})+\gamma\upsilon_{\pi}(s^{'})\right] \end{split}\]

Contents