A Sad Pony Mathematician's blog: Let's read: Sutton's RL, week 3, chap 3

Wednesday, October 31, 2018

Let's read: Sutton's RL, week 3, chap 3

For week 3, we will do:

Rest of Sutton & Barto Chapter 3
Sutton & Barto Summary of Notation,
Sutton & Barto Section 4.1

Optimal Policy

Define: given a MDP problem, we have a corresponding policy spaces $\Pi$ of all possible policies for the problem. $\Pi$ is preordered by Pareto ordering:

\[\pi_1 \succeq \pi_2 \quad \text{iff} \quad \forall s\in S, v_{\pi_1}(s) \ge v_{\pi_2}(s)\]

That is, a policy is better than another iff using it does not deprove the expected value in all situations, and improves in at least one.

A policy $\pi$ is optimal iff it is a maximal element in $\Pi$ thus preordered.

The optimal state-value function is defined as as the best that can be done by any policy at a certain state, it's a bit subtler than that

\[v_*(s) = \max_\pi v_\pi(s)\]

and similarly for $q_*$

Thermonuclear War

It's a bit subtle because it's possible that the best policy at a certain situation is not the best policy in another. Imagine a game of Mutually Assured Destruction between USA and USSR. There are at least 4 policies for USA:

Surrender
First strike immediately at one o'clock.
Promise to second strike but won't actually do it.
Promise to second strike and would actually do it.

And the world has three states: Peace, War, Death. Death is very bad. War is bad. Peace is good but only without surrender. Policy 2 leads immediately to war. Policy 4, once War starts, leads immediately to Death.

A commitment to policy 4 is optimal during Peace, since it maximizes the expectation of Peace and thus the reward.

But if it ever happens that, just by bad luck (there have been many nuclear misses!) the world enters War, it will cause Death, and policy 3 is thus better during War.

So in the game of Mutually Assured Destruction, there is no optimal policy that actually achieves the optimal state-value function... or does it? Only a existence theorem could settle this.

The optimal policy and its corresponding value functions

Unfortunately, Sutton & Barto are very muddy here and completely glossed over the next theorem, which they assumed.

Existence Theorem of $v_*$ [Bellman, 1957?, I don't know, but I guess it's in his book Dynamic Programming]: For any finite MDP, the policy space $\Pi$, under Pareto ordering, has a global maximum $\pi_* \in \Pi$. Further, this global maximum $\pi_*$ can be chosen so that it's deterministic.

Assured of that existence, we can proceed to derive its properties:
\begin{equation} \begin{split} q_*(s, a) ={} & \max_\pi q_\pi(s, a)\\ ={} & \max_\pi \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1})| S_t = s, A_t = a]\\ ={} & \max_\pi ( \mathbb{E}[R_{t+1}| S_t = s, A_t = a] + \gamma\mathbb{E}[v_\pi(S_{t+1})| S_t = s, A_t = a])\\ ={} & \mathbb{E}[R_{t+1}| S_t = s, A_t = a] + \gamma\max_\pi \mathbb{E}[v_\pi(S_{t+1})| S_t = s, A_t = a] \end{split} \end{equation}

and by the existence theorem, we have

\[\max_\pi \mathbb{E}[v_\pi(S_{t+1})| S_t = s, A_t = a] = \mathbb{E}[v_*(S_{t+1})| S_t = s, A_t = a] \]

Since the best policy is deterministic, and by the One-Shot Deviation Principle, the optimal strategy must be locally greedy: choose the most valuable action.
$$v_*(s) = \max_{a\in A(s)} q_*(s, a)$$

so after some plugging, we get the Bellman equations for $q_*$, and similarly for $v_*$:
\begin{equation} \begin{split} q_*(s, a) ={} & \mathbb{E}[R_{t+1}+ \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\ ={} & \sum_{s', r} p(s', r|s, a) (r + \gamma v_*(s'))\\ ={} & \sum_{s', r} p(s', r|s, a) (r + \gamma \max_{a'\in A(s')} q_*(s', a')) \end{split} \end{equation}
\begin{equation} \begin{split} v_*(s) ={} & \max_{a\in A(s)}\mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\ ={} & \max_{a\in A(s)} \sum_{s', r} p(s', r|s, a)(r + \gamma v_*(s')) \end{split} \end{equation}
Which are the Bellman optimality equations.

Theorem [Bellman again?]: Any finite MDP has a unique solution to the Bellman equations for $v_*$.

Once $v_*$ is obtained, it's easy to play optimally by playing greedy: First, calculate $q_*$ from equation (2), then at state $s$, deterministically choose $\underset{a}{\operatorname{argmax}} q_*(s, a)$.

Using the Bellman equations

If we can solve the Bellman equations of a MDP, then we are done. But it's not so easy irl. It requires 3 assumptions:

the environment is a Markov chain
we know the dynamics of the environment
we can solve the equations

Only in toy problems are all three true. In interesting games of chance, like Risk, poker, backgammon, only the first two are true.

Size: 1182x1200 | Tagged: alicorn, artist:rodrigues404, board game, cute, ethereal mane, female, filly, helmet, looking at you, moonabetes, nightmare moon, nightmare woon, pony, risk, safe, solo, war, younger

Let's solve the Bellman equation together!

Many decision-making methods can be thought as approximately solving the Bellman optimality equations.

Exercises

3.20: Basically, 0 at the hole, -1 on the green, -2 one drive away from the green, -3 two drives away from the green, etc.
3.21: 0 at the hole, -1 on the green, -2 one put and zero drives away from the green, -3 one put and one drive away from the green, etc.
3.22: $\gamma = 0:$ no foresight, completely greedy, choose left. $\gamma = 0.9:$ enough foresight, choose right. $\gamma = 0.5:$ exactly apathetic.
3.23: Heck no, there are five equations! Do it if you have nothing better to do.
3.24: You know the optimal strategy, so just follow the optimal strategy and calculate the expectation of reward: $v_* = \mathbb{E}[G_t] = 10 + 0 + 0 + 0 + 0 + 10 \gamma^5 + \cdots = \frac{10}{1-\gamma^5}$
3.25, 26 See before.
3.27, 28: We said that the strategy is deterministic, such that, at state $s$, deterministically choose
\[\underset{a}{\operatorname{argmax}} q_*(s, a) = \underset{a}{\operatorname{argmax}} \sum_{s', r} p(s', r|s, a) (r + \gamma v_*(s'))\]
So $\pi(a | s) = 1$ if $a$ is as above, and $0$ otherwise.
3.29: By thinking about the terms' meanings, we get
\[v_\pi(s) = \sum_a \pi(a|s) q_\pi(s, a)\]
\[q_\pi (s, a) = r(s, a) + \gamma\sum_{s'}p(s'|s,a)v_\pi(s')\]
\[v_*(s) = \max_{a\in A(s)} q_*(s, a)\]
\[q_* (s, a) = r(s, a) + \gamma\sum_{s'}p(s'|s,a)v_*(s')\]
then plug into each other to get the Bellman equations.

A Sad Pony Mathematician's blog

Wednesday, October 31, 2018

Let's read: Sutton's RL, week 3, chap 3

Optimal Policy

Thermonuclear War

The optimal policy and its corresponding value functions

Using the Bellman equations

Exercises

No comments:

Post a Comment

Let's Read: Neuropath (Bakker, 2009)

Report Abuse

Labels