Last updated 2 months ago
Bilibili:
GitHub:
Sets
State: S\mathcal{S}S
Action: A(s)\mathcal{A}(s)A(s)
Reward: R(s,a)\mathcal{R}(s,a)R(s,a)
Probability distribution
State transition: p(s′∣s,a)p(s'|s,a)p(s′∣s,a)
Reward: p(r∣s,a)p(r|s,a)p(r∣s,a)
Policy: π(a∣s)\pi(a|s)π(a∣s)
Markov property
State value: vπ(s)=Eπ[Gt∣St=s]v_\pi(s)=\mathbb{E}_\pi[G_t|S_t=s]vπ(s)=Eπ[Gt∣St=s]
where Gt=∑k=0∞γkRt+k+1G_t=\sum_{k=0}^\infty \gamma^k R_{t+k+1}Gt=∑k=0∞γkRt+k+1
Bellman equation: vπ(s)=∑aπ(a∣s)∑s′[∑rp(r∣s,a)r⏟immediate reward+γ∑s′p(s′∣s,a)vπ(s′)⏟future reward]v_\pi(s)=\sum_a \pi(a|s)\sum_{s'}[\underbrace{\sum_r p(r|s,a)r}_{immediate\ reward} + \underbrace{\gamma \sum_{s'} p(s'|s,a)v_\pi(s')}_{future\ reward}]vπ(s)=∑aπ(a∣s)∑s′[immediate rewardr∑p(r∣s,a)r+future rewardγs′∑p(s′∣s,a)vπ(s′)]
Matrix form: vπ=rπ+γPπvπv_\pi=r_\pi + \gamma P_\pi v_\pivπ=rπ+γPπvπ