强化学习的数学原理

0. Preview

1. Basic Concepts

1.1 Markov Decision Process (MDP)

  • Sets

    • State: S\mathcal{S}

    • Action: A(s)\mathcal{A}(s)

    • Reward: R(s,a)\mathcal{R}(s,a)

  • Probability distribution

    • State transition: p(ss,a)p(s'|s,a)

    • Reward: p(rs,a)p(r|s,a)

  • Policy: π(as)\pi(a|s)

  • Markov property

2. State Value and Bellman Equation

2.1 State Value

  • State value: vπ(s)=Eπ[GtSt=s]v_\pi(s)=\mathbb{E}_\pi[G_t|S_t=s]

    where Gt=k=0γkRt+k+1G_t=\sum_{k=0}^\infty \gamma^k R_{t+k+1}

2.2 Bellman Equation

  • Bellman equation: vπ(s)=aπ(as)s[rp(rs,a)rimmediate reward+γsp(ss,a)vπ(s)future reward]v_\pi(s)=\sum_a \pi(a|s)\sum_{s'}[\underbrace{\sum_r p(r|s,a)r}_{immediate\ reward} + \underbrace{\gamma \sum_{s'} p(s'|s,a)v_\pi(s')}_{future\ reward}]

    • Matrix form: vπ=rπ+γPπvπv_\pi=r_\pi + \gamma P_\pi v_\pi

Last updated